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Abstract 

We analyze the convergence properties of the Wang-Landau algorithm. This sampling 
method belongs to the general class of adaptive importance sampling strategies which use 
the free energy along a chosen reaction coordinate as a bias. Such algorithms are very 
helpful to enhance the sampling properties of Markov Chain Monte Carlo algorithms, 
when the dynamic is metastable. We prove that the Wang-Landau algorithm converges 
with an associated central limit theorem, and we provide an analysis of the efficiency of 
the algorithm in a metastable situation. 

1 Introduction 

The Wang-Landau algorithm was originally proposed in the physics literature to efficiently 
sample the density of states of Ising-type systems |3CH 131] , From a computational statistic 
viewpoint, it can be seen as some adaptive importance sampling strategy combined with a 
Metropolis algorithm: the instrumental distribution is updated at each iteration of the algo- 
rithm in order to have a sampling of the configuration space as uniform as possible along a 
given direction. There are numerous physical and biochemical works using this technique to 
overcome sampling problems such as the ones encountered in the computation of macroscopic 
properties around critical points and phase transitions, or for the sampling of folding mecha- 
nisms for proteins. The success of the technique motivated its use and study in the statistics 
literature, see |22[ |2~3"1 [2j [151 E| f°r instance for previous mathematical and numerical studies. 



1.1 Free energy biasing techniques 



The Wang-Landau algorithm belongs to the class of free energy biasing techniques |19| which 
have been introduced in computational statistical physics to efficiently sample thermodynamic 
ensembles and to compute free energy differences. These algorithms can be seen as adaptive 
importance sampling techniques, the biasing factor being adapted on-the-fly in order to flatten 
the target probability measure along a given direction. Let us explain this with more details. 

Let 7T be a multimodal probability measure over a high-dimensional space X C M. D . Clas- 
sical algorithms to sample tt (such as a Metropolis-Hastings procedure with local proposal 
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moves) typically converge very slowly to equilibrium since high probability regions are sepa- 
rated by low probability regions. Averages have to be taken over very long trajectories in order 
to visit all the modes of the target probability measure tt. The idea of free energy biasing 
techniques is to flatten the target probability along a well-chosen direction through an impor- 
tance sampling procedure in order to more easily sample tt. More precisely, assume that we 
are given a measurable function O defined on X and with values in a low dimensional compact 
space, or in a discrete space. This function is sometimes called a reaction coordinate or an 
order parameter in the physics literature. Let us introduce O * tt the image of the measure 
tt by O: for any test function ip, E n {p o 0(X)) = KQ^ n (p(X)) (where by definition, for any 
probability measure v and any test function h, K u (h(X)) = j hdu). The free energy biased 
probability measure tt* is defined by the two following properties: (i) the conditional measures 
7T*(dx) given the value of 0{x) are the same as the conditional measures n{dx) given the value 
of 0(x) and (ii) the image of tt* by O is the uniform measure. 

Let us give two prototypical examples. When O = £ is a smooth function with values in 
a continuous space, for example £ : X — > T (where T = M/Z is the one-dimensional torus), we 
have 



where p : T — > K is the density of the measure ^ * tt with respect to the Lebesgue measure 
on T. In this case, A(z) = — In p(z) can be interpreted as a free energy |21| . hence the name 
"free energy biasing techniques". When O = I is a function with values in a discrete finite set 
(this will be the case considered in this paper), / : X — > {1, . . . , d}, we have 



where 6*(i) = tt ({x G X, I(x) = i}) is the law of the measure / * tt on {1, . . . , d}. 

The bottom line of free energy biasing techniques is that it should be easier to sample tt* 
than to sample tt since, by construction, O * tt* is the uniform probability measure. Then, 
sampling from tt could be obtained by importance sampling from tt*. The fact that tt* is 
indeed much easier to sample than tt actually depends on the choice of O. It is not an easy 
task to define and to design in practice a good choice for O and we do not discuss further 
these aspects here. This is related to the choice of a "good" reaction coordinate in the physics 
literature, which is a very debatable subject. We refer for example to [7] for such an analysis in 
the context of free energy biasing techniques used to sample posterior distributions in Bayesian 
statistics. 

Of course, the difficulty is that in general, O * tt is unknown (equivalently p in (JT]), or 
9* in ([2]), are unknown) so that it is not possible to sample from tt* . The idea is then to 
approximate O * tt on the fly in order to, in the longtime limit, sample from tt* . This is the 
adaptive feature of these algorithms: the importance sampling factor is computed as time 
goes, in order to penalize states (namely level sets of O) which have already been visited. To 
approximate tt* at a given time, one could either use the occupation measure of the Markov 
chain up to the current time (this is typically what is done in practice in the molecular 
dynamics community) or one could use an approximation over many Markov chains running 
in parallel |191 127] , Moreover, one could either think of approximating O * tt (these are the 
so-called Adaptive Biasing Potential (ABP) techniques) or, in the case O is a continuous 
order parameter, approximating A'(z) (these are the so-called Adaptive Biasing Force (ABF) 



Tr*{dx) = {l/p)o£(x)ir{dx) 
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techniques [HJ Q3] ) . This gives rise to many algorithms in the literature |19| , which are more 
or less efficient and more or less difficult to analyze mathematically. We refer for example 
to [33] for ABF techniques using the occupation measure, to |27| [Ttj] for ABF techniques 
using many replicas in parallel, to [30] for ABP approaches using the occupation measure and 
to [3j [5] for ABP approaches using many replicas in parallel. Before discussing the efficiency 
and the mathematical analysis of these algorithms, let us emphasize that in many applications, 
computing the measure O * it (or equivalently the free energy) is actually the main goal [21] . 

Roughly speaking, from a practical viewpoint, most ABP approaches (like the Wang- 
Landau algorithm) are more involved to use since they typically require to introduce a vanish- 
ing adaption mechanism. Indeed, even if one starts with a very good approximation of O * tt, 
and thus with a probability measure very close to ir* , the adaptive mechanism will introduce 
a non-zero biasing factor to penalize visited level sets of O, as time goes. One crucial feature 
of ABP approaches is thus to penalize less and less (as time goes) the visited states, so that 
in the longtime limit, no adaption is performed anymore. The way this adaption mechanism 
is performed is made precise below in the Wang-Landau case. We would also like to mention 
that some ABP techniques without externally imposed vanishing adaption have been pro- 
posed, like the self-healing umbrella sampling |24[ [9], but we do not discuss them since they 
are not related to Wang-Landau algorithms. ABF approaches do not require such a vanishing 
adaption mechanism since the approximation of A' (z) is based on conditional measures given 
the value of O, which are not affected by the biasing factor (since it only depends on O). 
However, ABF techniques cannot be used for discrete order parameters. 

In terms of mathematical analysis, approximations based on many replicas in parallel are 
typically easier to analyze, since they can be related (in the limit of infinitely many replicas) 
to mean field models for which powerful longtime convergence analysis techniques can be 
used. We refer for example to |20[ [T8] for such an analysis for an ABF technique. In [20] 
for example, it is shown that the method is efficient if the family of conditional probability 
measures n(dx) given O(x) have good mixing properties (namely large Logarithmic Sobolev 
Inequality constants). The convergence analysis and, more importantly, the study of the 
efficiency of free energy biased techniques for approximations based on the occupation measure 
are much more involved since correlations in time of the Markov process play a crucial role. 
The aim of this paper is to propose such an analysis for the Wang-Landau algorithm, which 
is an ABP approach. 

1.2 Objectives and main results 

The Wang-Landau algorithm both computes a (penalty) sequence {# n ,n > 0} approximating 
(in the longtime limit) the measure O * it and samples draws {X n ,n > 0} distributed (in 
the longtime limit) according to tt*. The update of the penalty sequence follows a Stochastic 
Approximation algorithm [291 H] and is of the form 



Different strategies about the field T-L n and the adaption schedule {"f n , n > 1} have been pro- 
posed in the literature. In the original paper [30] . the authors came up with a stochastic 
adaption schedule hereafter called flat histogram Wang-Landau. In this procedure, the up- 
dating parameter 7 remains constant up to the (random) time when the sampling along the 
chosen order parameter O is approximately uniform, the "amount of uniformity" being mea- 
sured according to the current value of 7. Then 7 is lowered and a new updating procedure 
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of the weights starts with a constant 7. Another strategy consists in a deterministic update 
of the adaption sequence { r y n , n > 1}- 

Despite the Wang-Landau algorithm has been successfully applied, there are many open 
questions about its long-time behavior and its efficiency. Such a longtime behavior study 
relies on the convergence of stochastic approximation algorithms with Markovian inputs [H [1] 
combined with the convergence of adaptive Markov chain Monte Carlo samplers ; for both 
parts, the stability of the sequence {6 n ,n > 0} is a fundamental property. Stability here 
means that the sequence {6 n ,n > 0} remains in a compact subset of probability measures 
with support equal to the support of O * tt (as explained in Section [3l this is related to a 
recurrence property). 

The asymptotic behavior of the flat histogram Wang-Landau algorithm, when % n is such 
that in some sense, n counts the number of visits to the level sets of O, has been considered 
by [21 ES] ■ One crucial step is to show that the time r to reach the flat histogram criterium 
is finite with probability one. In [2], it is proved for a specific field T~L n , that r is finite 
almost-surely, the sequence {9 n ,n > 0} is stable and converges almost-surely. A strong law of 
large numbers for the draws {X n ,n > 0} is also established for a wide family of unbounded 
functions. In |15j . the authors show that the precise form of T~L n plays a role on the convergence 
of the flat histogram Wang-Landau algorithm (see Section [2.21 for more details). 

In this paper, we consider the Wang-Landau algorithm with a deterministic adaption 
sequence {"f n , n > 1} ( see again Section [2.21 for a precise definition of the algorithm). The 
aim of this article is twofold. First, we address both the convergence of {6 n ,n > 0} and the 
convergence of {X n ,n > 0} to tt*. More precisely, we prove that the sequence {9 n ,n > 0} 
is stable; we also prove its almost-sure convergence as a well as a Central Limit theorem. 
We then prove the ergodicity and a strong law of large numbers for the draws {X n ,n > 0}. 
Second, we intend here to discuss from a mathematical viewpoint the efficiency of the Wang- 
Landau procedure. We believe indeed that the convergence results are a necessary first step in 
the study of the Wang-Landau algorithm, but are by no means the end of the story. Indeed, 
the real practical interest of adaptive techniques are their improved convergence properties. 
Although this improvement is obvious to practitioners, it is mathematically more difficult to 
formalize. 

Concerning the convergence result, we would like to mention the previous work |23| where 
some results about the longtime analysis for Wang-Landau with deterministic adaption can 
be found. In this paper, the authors combine the Wang-Landau algorithm with a reprojection 
technique so that the sequence {9 n ,n > 0} is stable by definition; then, they prove the 
convergence of the sequence whenever the limiting point is in the interior of the reprojection 
space. Therefore, our results extend the work by |23] by precisely analyzing the stability of 
the algorithm, by addressing the convergence of {0 n ,n > 0} under weaker assumptions and 
by proving additional asymptotic analysis. 

Our second contribution is about the efficiency of the algorithm. To our knowledge, the pre- 
vious mathematical studies on the Wang-Landau algorithm solely focused on the convergence 
of the algorithm, not on its efficiency. As mentioned above, such insight into the convergence 
properties has been obtained for adaptive methods of ABF type using approximations based 
on many replicas in parallel, see |20[ 118], We propose here two viewpoints: first, we show a 
Central Limit Theorem on the sequence {0 n ,n > 0}, which provides the convergence rate of 
the algorithm. Moreover, we show through the analytical study of a toy model and a con- 
firmation by numerical results in a more complicated case, that the Wang-Landau algorithm 
indeed allows to efficiently escape from metastable states. 



4 



The paper is organized as follows. We describe in Section [2] the algorithm we consider 
and compare it to previously proposed Wang-Landau type algorithms. We then study its 
convergence in Section [31 We first prove in Section [3.21 a fundamental stability result. Then 
we deduce convergence properties relying on previous results on stochastic approximation with 
Markovian inputs and on the theory of adaptive Markov chain Monte Carlo samplers. We 
next turn to an important discussion on the efficiency of the method in Section 0] where we 
try to quantify mathematically the improvement on the convergence properties given by the 
Wang-Landau dynamics. The proofs of the results presented in Sections [3] and [J] are gathered 
in Section [5j 

2 Description of the Wang-Landau algorithm 
2.1 Notation and preliminaries 

The system we consider is described by a normalized target probability density ir defined 
on a Polish space X, endowed with a reference measure A defined on the Borel <7-algebra 
X. Notice that, as for classical Metropolis-Hastings procedure, the practical implementation 
of the algorithm only requires to specify ir up to a multiplicative constant. In statistical 
physics, X typically is the set of all admissible configurations of the system while tt is a Gibbs 
measure with density ir(x) = Z^ 1 exp(— (3U(x)), U being the potential energy function and 
f3 the inverse temperature. In condensed matter physics for instance, actual simulations are 
performed on systems composed of N particles in dimension 2 or 3, living in a cubic box with 
periodic boundary conditions. In this case, X = (LT) 2N or X = (LT) 37V , where L is the length 
of the sides of the box and T = 1R/Z is the one-dimensional torus. 

Consider now a partition Xi,...,X<2 of X in d > 2 elements, and define, for any i G 

+ (i) *M f ir(x)\(dx) . (3) 

In the following, Xj will be called the i-th stratum. Each weight 0*(i), which is assumed to be 
positive, gives the relative likelihood of the stratum Xj C X. In practice, the partitioning could 
be obtained by considering some smooth function £ : X — > [a, b] (called a reaction coordinate 
in the physics literature) and defining, for i = 1, . . . , d — 1, 

x i = r 1 ([«i-i,oi)) , (4) 

and X^ = £ _1 ([ad-i, with a = oq < a\ < ...a>d = b (possibly, a = — oo and/or 

b = +oo). In the notation of the introduction, the order parameter is thus the discrete 
function I : X — > {1, . . . ,d} defined by 

Vx G X, I(x) = i if and only if x G X, . (5) 

As mentioned above, the choice of an appropriate function / is a difficult issue, and is mostly 
based on intuition at the time being: practitioners identify some slowly evolving degree of 
freedom responsible for the metastable behavior of the system (the fact that trajectories 
generated by the numerical method remain trapped for a long time in some region of the 
phase space, and only occasionally hop to another region, where they also remain trapped). 
There are however ways to quantify the relevance of the choice of the reaction coordinate, see 
for instance the discussion in [7]- 
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The above discussion motivates the fact that the weights typically span several orders 
of magnitude, some sets Xj having very large weights, and other ones being very unlikely 
under ir. Besides, trajectories bridging two very likely states may need to go through unlikely 
regions. To efficiently explore the configuration space, and sample numerous configurations in 
all the strata Xj, it is therefore a natural idea to resort to importance sampling strategies and 
reweight appropriately each subset Xj. A possible way to do so is the following. Let be the 
subset of (non-degenerate) probability measures on {1, . . . ,d} given by 



8 = {e = (9(l),..., 9(d)) 



a 

< e(i) < 1 for alU G {1, . . . , d} and ^ 6{i) = 1 

i=i 



For any 9 € 0, we define the probability density irg on (X, X) (endowed with the reference 
measure A) as 

This measure is such that the weight of the set Xj under ng is proportional to 0(i)/9+(i). In 
particular, all the strata Xj have the same weight under irg^. Unfortunately, 6+ is unknown 
and sampling under ttq^ is typically unfeasible. 

The Wang-Landau algorithm precisely is a way to overcome these difficulties: at each 
iteration of the algorithm, ci weight vector 6 n — {0 n (1), . . . ,9 n (d)) is updated based on the 
past behavior of the algorithm and a point is drawn from a Markov kernel Pg n with invariant 
density ix$ n . The intuition for the convergence of this algorithm is that if {9 n , n > 0} converges 
to 9+ then the draws are asymptotically distributed according to the density irg^ . Conversely, 
if the draws are under irg^, then the update of {9 n ,n > 0} is chosen such that it converges to 
Q+. We will derive below sufficient conditions on the sequence {7^, n > 1} of step-sizes used to 
update {9 n ,n > 0} and on the Markov kernels {Pg,9 G 0} in order to prove the convergence 
of a version of the Wang-Landau algorithm, namely a linearized Wang-Landau algorithm with 
a deterministic adaption: the step-size j n is used at the n-th iteration of the Markov chain. 



2.2 The linearized Wang-Landau algorithm with deterministic adaption 

We now describe the algorithm we study in this article. Let {"f n , n > 1} be a [0, l)-valued 
deterministic sequence. For any 9 £ 0, denote by Pg a Markov transition kernel onto (X, X ) 
with unique stationary distribution TTg(x)\(dx); for example, Pg is one step of a Metropolis- 
Hastings algorithm |25} I13| with target probability measure irg(x)\(dx). 

Consider an initial value Xq S X and an initial set of weights 9q G (typically, in absence 
of any prior information, 9q{i) = 1/d). Define the process {(X n ,9 n ),n > 0} as follows: given 
the current value (X n ,9 n ), 

(1) Draw X n+ \ under the conditional distribution Pg n (X n , ■); 

(2) Set i = I{X n+ i) where / is given by ([5]). The weights are then updated as 

f 9 n+1 (i) = 9 n (i) + 7n+ i 9 n (i) (1 - 9 n (i)) , 

\ 9 n+l {k) = 9 n (k) - ln+1 9 n {k) 9 n (i) for k^i. { ' } 
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Note that since j n S [0,1), 9 n £ © for any n > 0. As explained in the introduction, the 
idea of the updating strategy ([7]) is that the weights of the visited stratas are increased, in 
order to penalize already visited states. The update of the probability vector 8 n can be recast 
equivalently into the stochastic approximation framework upon writing 

#n+l = #n + 7n+l H(X n+ i,8 n ) , (8) 

where H : X x — > [ — 1, l] d is defined componentwise by 

H i {x,9) = e{i){t Xi {x)-e{I{x))) , (9) 

where the function / is given by 

The updating strategy (J7J) (or equivalently (JSJ) is a modification of the original Wang- 
Landau algorithm obtained by (i) using a deterministic schedule for the evolution of the 
step-sizes used to modify the values of the weights (instead of reducing the value of these 
step-sizes at random times when the distribution of the empirical frequency visit to each 
stratum is sufficiently uniform: this is the flat histogram version of the Wang-Landau algorithm 
mentioned in the introduction) and (ii) linearizing at first order in 7 n the update of the weight 

On- 

Concerning this second point, the standard Wang-Landau update is 

9 n+1 (l) = e n (l)—— —r— rr . (10) 

1 + Jn+lVn{l (-X-n+l)) 

The update ([7]) is obtained from (|10[) in the limit of small j n . For the stability and the conver- 
gence analysis in Section [3l we adopt this linear update. The main advantage is that it makes 
the proof of the stability simpler; nevertheless, since ^ n converges to zero, the convergence 
results proved in Sections 13.31 and 13.41 are unchanged and could be proved along the same lines 
(these details are omitted in this paper). For the analysis of the efficiency of the algorithm in 
Section^ we will use the standard updating rule (|10p . 

We would like to emphasize here that this distinction between the two updating strate- 
gies ([7]) and (|10p does matter when considering the flat histogram criterium for the vanishing 
adaption procedure, as proved in [15] . Indeed it is shown in |15j that the linearized version of 
the update (J7J) allows to satisfy in finite time the uniformity criterion required in the original 
Wang-Landau algorithm, whereas this is not guaranteed for the nonlinear update (fT0|) . 



3 Convergence of the Wang-Landau algorithm 

The proof of the convergence of the Wang-Landau algorithm described in Section \2 . 21 relies on 
its reformulation ([8]) as a stochastic approximation procedure. Since the draws {X n ,n > 1} 
satisfy for any measurable non-negative function /: 

E[f(X n+1 )\T n ] = P J(X n ) , (11) 

where J- n denotes the a- field (t(8q, Xq, X±, . . . , X n ), it is a so-called "stochastic approximation 
with Markovian dynamics" (see e.g. |4]). 

The main difficulty, when proving the almost-sure convergence of such algorithms, is the 
stability : i.e. how to ensure that the sequence {9 n , n > 0} remains in a compact subset of 0. 
We use a traditional approach to answer this question : we first prove that our algorithm 
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satisfies a recurrence property i.e. the sequence {9 n ,n > 0} visits infinitely often a compact 
subset of O; we then show that there exists a Lyapunov function with respect to the mean-field 



with strong enough properties so that the recurrence property implies stability. Different 
strategies based on truncations are proposed in the literature to circumvent the stability 
problem (see e.g. [IT]). The most popular technique is the truncation to a fixed compact set 
but this is not a satisfactory solution since the choice of this compact is delicate : a necessary 
condition for convergence is that the compact contains the unknown desired limit. An adaptive 
truncation has been proposed by [6] which avoids the main drawbacks of the deterministic 
truncation approach. We prove in Section [3.21 that, under conditions on the target density n 
and the step-size sequence {■j n ,n > 1}, the algorithm ([8]) is recurrent, so that such truncation 
techniques are not required. 

In Section [3.3| we address the almost-sure convergence of the weight sequence {6 n ,n > 0}. 
We then obtain in Section 13.41 the convergence in distribution and a strong Law of large 
numbers for the samples {X^, k > 0}. Finally, we obtain a central limit theorem in Section [3751 
for the weight sequence {9 n ,n > 0}. 

3.1 Assumptions on the Metropolis dynamics and on the adaption rate 

Our conditions fall into three categories: conditions on the equilibrium measure (see A[l|, on 
the transition kernels {P$, G 0} (see AE|) and conditions on the step-size sequence {jn, n > 1} 
(see AE|). It is assumed that 

Al The probability density ir with respect to the measure A is such that < infx vr < 
sup x vr < oo. In addition, infi<i<d 0*(i) > where 6* is given by ([3]). 

The first part of Assumption AH] is satisfied, for example, for smooth positive densities on 
a compact state space X C M B with the Lebesgue measure as the reference measure A, or for a 
positive probability measure on a discrete finite state space X = {1, . . . , K} with the uniform 
measure as the reference measure. Since infx t is assumed to be positive, the second part of 
the assumption is satisfied as soon as infi<j<d A(Xj) > 0. The minorization condition on tt 
certainly is the most restrictive assumption: it is introduced in order to prove the recurrence 
of the algorithm ([8]). This condition can be removed by adding a stabilization step to ([8]) (such 
as a truncation technique at random varying bounds |U[T7]) in order to ensure the recurrence. 

The second assumption is: 

A2 For any 8 6 O. Pg is a Metropolis-Hastings transition kernel with invariant distribu- 
tion TTg dX, where ttq is given by ([6]), and with symmetric proposal kernel q(x,y)X(dy) 
satisfying inf X 2 q > 0. 

The transition probability for a symmetric Metropolis-Hastings dynamics reads 



function h : ->• [-1, l] d 




(12) 



P e (x,dy) = q(x,y)a e (x,y) X{dy) + S x (dy) 1 
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with 

/ \ i a 7T e{y)q{y,x) ir (y) 

a e {x, y) = lA r = 1 A -— - , 

ire{x)q{x,y) ir e (x) 

the last equality being a consequence of the symmetry of q. Assumption A[2] is satisfied for 
instance when X = T n (a cubic simulation cell endowed with periodic boundary conditions), 
and q(x, y) = q(y — x) for a positive density q such that infx q > 0. 

The minorization condition on q implies that the transition kernels {Pe,0 G 0} are uni- 
formly (geometrically) ergodic, as stated in Proposition 13.11 below. This property allows a 
simple presentation of the main ingredients for the limiting behavior analysis of the algo- 
rithm. Extensions to a more general case could be done by using the same tools as in 
(see also [U Section 3]) and controlling the dependence upon 9 of the ergodic behavior. These 
technical steps are out of the scope of this paper. 

We prove in Section [5.11 the result: 

Proposition 3.1. Under 43 and 4H there exists p G (0, 1) such that for all 6 G Q, for all 

x G X and for all A G X , it holds: 



P 9 (x,A) >p ir e < 
J A 



x) \{dx) , (13) 



sup sup \\P?(x, •) - irg d\\\ TY < 2(1 - p) n , (14) 

where for a signed measure p, the total variation norm is defined as 

1 1 A* 1 1 TV = sup \p{f)\ ■ 

{/:sup x |/|<l} 

We finally introduce conditions on the magnitude of the step-size sequence. 
A3 the sequence {"f n , n > 1} is a [0, l)-valued deterministic sequence such that 

a) {"im n > 1} is a (ultimately) non-increasing sequence and lim„7 n = 0; 

b) E n 7n = oo; 

c ) En7n < °°- 

Examples of step-size sequence satisfying assumption A|3] are the polynomial schedules 
7n = "f±/n a with 1/2 < a < 1. As already observed in Section [2.2[ the condition 7„ £ [0, 1) 
implies that if 9q G 0, then for any n > 1, 6 n G 0. Assumption Al3a1 is introduced for the 
proof of the recurrence property. Assumptions A ISTjk cl are standard conditions for the stability 
and the convergence of a stochastic approximation scheme since the pioneering work [29j . 

3.2 Recurrence property of the weight sequence {9 n ,n > 0} 

We prove in this section tha/t, almost surely, there exists el compact subset of © such that n 
belongs to this compact subset for infinitely many n. For any n > 0, set 

n = min 9 n (j) . (15) 
We prove in Section 15.21 the following theorem: 

Theorem 3.2. Assume 421 41 and AM Then, F (lim sup^^ n > 0) = 1. 
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The proof is based on the following consideration. The value of the smallest weight in- 
creases when the chain goes into the corresponding stratum (see the updating formula ©). 
Under the stated assumptions, we prove that the chain {X n ,n > 0} returns in the strata of 
smallest weights often enough for the smallest weight to remain isolated from 0. 

3.3 Convergence of the weight sequence {9 n ,n > 0} 

In this subsection, the almost-sure convergence of the sequence {9 n ,n > 0} to 0+ is addressed. 
We prove in Section [5.31 the following convergence result: 

Theorem 3.3. Assume 4H 41 and 43 Then, P (lim n ^ +00 9 n = 9*) = !. 

The proof relies on fJJ which provides sufficient conditions for convergence of stochastic 
approximation techniques. The first step consists in rewriting the weight update ([8|) as 

9 n +i = 9 n 

where h is given by (fl"2"j) . The heuristic idea is that, if the step-size is rapidly sufficiently 
small, and the Metropolis dynamics converges sufficiently fast to equilibrium for fixed (a 
result given by Proposition 13. ip . the update of 9 n is indeed close to an update with the 
averaged drift h(9 n ). However, in order for the updates of the weights to be non-negligible, 
the step-sizes should not be too small. The balance between these two opposite effects is 
encoded in the conditions A lSlbk cl 

From a technical viewpoint, the proof of the theorem relies on two main tools. The first 
one (see Proposition I5.5P is to show that the function V : & — > M+ given by 

^)=E^«log(H) (17) 

is a Lyapunov function with respect to the mean-field h, namely (VV(9), h(9)) < for 9 ^ 9+ 
and (VV(0*), h{9*)) = (here, {■, •) denotes the scalar product in W D ). This motivates the fact 
that {9 n , n > 0} may converge to 9*. The second important result establishes that the remain- 
der term 7 n +i (H(X n +i,0 n ) — h(9 n )) in (|16p vanishes in some sense (see Proposition 15 . 10[) . 
This step is quite technical and requires regularity-in-0 of the transition kernels Pq and the 
invariant distributions ttq (see Lemmas 15.61 and 15. 7p . The conclusion then follows from [U 
Theorem 2.3] and Theorem 13.21 

3.4 Ergodicity and Law of large numbers for the samples {Xk, k > 0} 

In this subsection, we discuss the asymptotic behavior of the chain {X^,k > 0}. The main 
result is the following (see Section [5.41 for the proof). 

Theorem 3.4. Assume 43 4H and 43 Then for any bounded measurable function f , 

limE [f{X n )\ = J f(x) ng^x) \(dx) , (18) 

i n r 

-J2f(X k ) ^> / f( X ) TTg^x) X(dx) . (19) 
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This theorem shows that the distribution of the sample X n converges to 7Tg ir (x)X(dx), 
where, we recall 

i=l 

Moreover, the empirical mean of the samples {f(Xk),k > 0} converges to J f ttq, dX. Hence, 
although the weights 9 n evolve in the adaptive algorithm, ergodic averages can be thought of 
as averages with fixed weights 9*. 

In many practical cases, averages with respect to ir are of interest. In this case, the 
Wang-Landau procedure is used as some adaptive importance sampling strategy. In order to 
obtain averages according to ir along a trajectory of the algorithm, some reweighting has to 
be considered. A natural strategy is to use some stratified-type weighted sum of the samples 
{X k , k > 1}: 



i=\ \ k=l / 



We prove in Section [5.51 the following result: 

Theorem 3.5. Assume 42 J$M and 41 Then for any bounded measurable function f, 

' d 

Y J Gn{i)f{X n )t Xi {X ri 



i=l 



f(x) n(x) X(dx) 



ln(f) 



f{x) ir(x) X{dx) 



(20) 



(21) 



There are of course many other reweighting strategies. We have discussed one possible 
choice, but we do not claim that the above estimator is the best one. 



3.5 Central limit theorem for the weight sequence 

In this section, we state a Central Limit Theorem on the error n — along any sequence 
{8 n ,n > 0} converging to 9+; recall that the convergence of {9 n ,n > 0} to 6+ happens with 
probability 1 in view of Theorem 13.31 We show that the rate of convergence depends upon the 
step-size sequence {7^, n > 1} and discuss an averaging strategy in order to reach the optimal 
rate of convergence. An additional assumption is required on the sequence {"f n , n > 1} : 

A4 lim„7 n y / n = 0, and one of the following condition holds: 

(i) log(7n/7n+l) = o(7n); 

(ii) log(7„/7 n+ i) ~ 7„/7* with 7* > d/2. 

The latter conditions are satisfied for sequences j n = -f±/n a , when a £ (1/2, 1) for (i), or when 
a = 1 and 7* > d/2 for (ii). Under this additional assumption, the following result holds (see 
Section \5. 61 for the proof). 

Theorem 3.6. Assume that 43 41 M and A\4\ hold. Then {7^ 1/2 (9 n - 0*) , n > 1} con- 
verges in distribution to a centered Gaussian distribution with variance- covariance matrix a 2 U* 
where a 2 = d/2 in case (i) and a 2 = 7,^/(27* — d) in case (ii), 

C4 = f J { He* (x)Hl (x) - P 6i , (x) P 0t tf£ (x) } ^ (x) X(dx) , (22) 
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and 

n>0 n>0 

Notice that is the Poisson solution associated to the pair (Pg^, H(-, 0*)), namely Hg t 
is a solution to: find g : X — > R such that 

<? - P e ,<? = (•, 0*) - y fr(x, 0*) 7r fli (x) X(dx) . 

By Proposition 13.11 and the results of |26[ Chapter 17], such a function exists and is unique 
up to an additive constant. 

Theorem 13.61 shows that the rate of convergence depends upon the step-size sequence 
{7m n > 1} : when 7 n = 7*/ra a for a £ (1/2,1], the maximal rate of convergence is reached 
with a = 1 and the rate is 0(n~ l l 2 ). When 7 n = 7*/n, one could be interested in optimizing 
the variance-covariance matrix: introducing a gain matrix V in the algorithm ([8]) yields the 
update 

Qn+l = &n + 7rt+lT H(X n+ i,9 n ) . 

It is proved in j4l Proposition 4 p. 112] that for a large family of gain matrix (so-called "ad- 
missible gains") a Central Limit Theorem still holds for the sequence of random variables 
{i/n(# n — 0*),n > 0}, the minimal variance-covariance is equal to d 2 U* and is reached with 

r = d 7 ,- 1 id. ' 

From a practical point of view, it is known that stochastic approximation algorithms are 
more efficient when the step-size sequence decreases at a slow rate: in the polynomial schedule, 
this means that 7„ = 7*/ra a with a close to 1/2. As shown by Theorem 13. 6[ this yields a 
slower rate of convergence. Nevertheless, combining Wang-Landau update with an averaging 
technique allows to reach the optimal rate of convergence and the optimal variance-covariance 
matrix: by applying |10|. Theorem 1.4], it can be proved that {y/n (— Y!k=l ® k ~ ^*) ' n — ^ 
converges in distribution to a centered Gaussian distribution with variance-covariance matrix 
d 2 U*. The proof of this claim is along the same lines as the proof of Theorem 13.61 and details 
are therefore omitted. 

4 Efficiency of the Wang-Landau algorithm 

We present in this section results on the improved convergence properties of the Wang-Landau 
algorithm (when compared to non-adaptive samplers), by analyzing theoretically and numeri- 
cally the first exit times out of a metastable state. Indeed, adaptive biasing techniques such as 
the Wang-Landau algorithm have been especially designed to be able to switch as fast as pos- 
sible from a metastable state to another in order to efficiently explore the whole configuration 
space. 

We show in this section that the Wang-Landau algorithm allows to escape rapidly from 
a metastable state, namely from a large probability stratum surrounded by small probability 
strata. First, we consider in Section 14.11 a toy model composed of only three strata: two 
large probability strata (the metastable states) separated by a low probability stratum (the 
transition state). We are able to precisely quantify the time the system needs to go from the 
first metastable state to the second one, for adaptive and non-adaptive dynamics. We show in 
particular that the exit time is dramatically reduced with Wang-Landau dynamics compared 
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to the associated non adaptive dynamics. We then turn to a less simple example in Section H~2l 
where we present numerical results consistent with the theoretical behavior obtained with the 
toy model. 



4.1 Analytical results in a simple case 

We consider in this section a very simple toy model (if not the simplest one possible): only 
three strata, each stratum being composed of a single state. Thus, we have X = {1, 2, 3} and 
X, = {i} for i = 1,2, 3. Jumps are only allowed between neighboring states, namely from 1 to 
{1, 2}, from 2 to {1, 2, 3} and from 3 to {2, 3}. Though being very simple, we believe that this 
toy model is prototypical of a metastable dynamics. We will check numerically in the next 
section that our conclusions on this simple test case are indeed also valid for more complicated 
(and more realistic) situations. 



4.1.1 Definition of the dynamics 



We assume that the first and third stratum are visited with high probability, and that the 
second stratum is visited with low probability. More precisely, we set 



,(2) 



2 + e 



.(1) = 0*(3) 



1 



2 + e 



(23) 



for a small positive parameter e S (0,1), and consider the limit e — > 0. The target density 
7r on X is thus defined as: tt({z}) = 9*{i) for i = 1,2,3 (the reference measure A being the 
uniform measure on X = {1, 2, 3}). The parameters depend on e, even though we do not 
explicitly indicate this dependence to keep the notation simple. We also introduce the biased 
probability measure 



7Tg(i) 



E 



0(3) 



g*(0 



Notice that irg^ = (1/3, 1/3, 1/3) is the uniform measure on X. 

The basic building block for the reference non-adaptive Markov chain {X n ,n > 0} is a 
symmetric proposal kernel allowing transitions to nearest-neighbor strata only: 



Q 



The corresponding non-adaptive Markov chain is built using a Metropolis Hastings algorithm, 
with Q as the proposal kernel, and tt as the target distribution. Since e < 1, its kernel is given 
by 



■ 2 


1 

















1 


1 


1 


3 


3 






1 


2 













3 . 



p 



3 

1 
3 





e 
3 
1 

3 

e 

3 





1 

3 
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The non-adaptive dynamics {X n ,n > 0} is metastable, in the sense that the time to go from 
stratum 1 to 3 

Ti^3 = min |n : X n = 3 starting from Xq = l| 

is very large, and more precisely of order 6/e (see Lemma 14. II below) . This is due to the fact 
that, in order to go from 1 to 3, the chain has to visit the very low probability transition state 2. 
This is a prototypical metastable dynamics reminiscent of what happens along molecular 
dynamics trajectories: due to the very high dimensional configuration space, only local moves 
are allowed (otherwise they would be mostly rejected) and thus, it is difficult to go from a 
very likely region to another one since they are usually separated by low probability zones. 

The associated adaptive Wang-Landau dynamics is defined by the couple (X n ,9 n ), where 

@n — ($n 

(1), 9 n (2), 9 n (3)) is a vector of three nonnegative real numbers. The normalized 
parameters 9 n associated to 9 n are 

The updating rules are as follows: for all n > 0, given (X n ,9 n ), 

X n+ i is built using one step of a Metropolis Hastings procedure, 

with Q as a proposal kernel, and ng n as a target distribution, (24) 
9 n+ i(i) = 9 n (i)(l +7 n+ ilx n+1 =i), 
where 

In = l*n~ a , (25) 

for a positive constant 7*, and a parameter a £ [1/2, 1]. Note that the updating rule in (|24p 
corresponds to the standard nonlinear update (llOj) . The transition kernel to go from X n to 
X n+ \ is given by Pg n where, for any 9, 



3 V 0{2) ) 3 v 0(2) 



3\e0(l) J 3\e6(l) e 9(3) J 3 \e 9(3) 

1 fe9(3) \ 1 / 9(3) 

We start from initially equiprobable strata 9q(\) = 9q(2) = 9q(3) = 1, so that itq = tt. Notice 
that the non-adaptive dynamics is simply the Markov chain with transition kernel Pni t iy It 
can be obtained from the adaptive dynamics by setting 7* = 0, in which case 9 n = 9q = (1,1,1) 
for all n > 0. As above for the non- adaptive dynamics, we define the time to go from stratum 
1 to stratum 3 for the Wang-Landau dynamics as 

7\^3 = min |n : X n = 3 starting from Xq = 1 j . 

The aim of this section is to show that, in some sense to be made precise, Ti_>3 is much 
larger than Ti_>3 i.e. the Wang-Landau dynamics is much less metastable than the corre- 
sponding non- adaptive dynamics. This is related to the fact that, when the stochastic process 
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{X n ,n > 0} remains stuck in stratum 1, this stratum gets more and more penalized (0 n (l) 
increases, see (|24j)). so that a transition to stratum 2 becomes more and more favorable. From 
stratum 2, a jump to stratum 3 is then very likely. This is the bottom line of the whole 
adaptive procedure: penalizing the already visited strata in order to explore very quickly new 
regions. 



4.1.2 Precise statement on the exit times 

We now provide a precise statement on how the exit times and scale when e goes 

to zero. For the non-adaptive dynamics, it holds (see Section [5.7.11 for the proof): 

Lemma 4.1. The time scales like 6/e, in the following sense: 

|E(T^ 3 )=l + f e ~ 1, (26) 
Vc > 0, lim P (^Ti^3 > c) = e~ c . (27) 



Eq. (|27p states that when e — > 0, eT\^ converges in distribution to an exponential 
random variable with parameter 1/6. 

Let us now consider the Wang-Landau dynamics (|24p . The following result holds (see 
Section \5. 7. 21 for the proof). 

Proposition 4.2. Let 7* and a be the two constants defining the sequence j n , as given by \25\) . 
Let us assume that a E [1/2,1], with 7* < 1 if a = 1/2. Then, in the case a S [1/2,1), for 
any two positive constants C a ,C b such that 

1 — a x 



C a < [ — ) < C b 

it holds 



hmPfllner^-^TUa € (C a , C b )) = 1 . (28) 



When a = 1, for any function h such that lim hie) = +00 , 

limPf/i(e)e 1 /( 1+7 ^T 1 ^ 3 > l) = limP ( — ^ e^+^T^ < 1 ] = 1 . (29) 
e->o V / e->o \h{E) ) 

In fact, a more accurate estimate for the lower bound in the case a S [1/2, 1) is provided 
below, in Proposition 15.111 In the second result (j29[) . one should think of functions h going 
very slowly to infinity, so that this result essentially shows that when a = 1, Ti_j>3 scales like 
£~ 1 /( 1 +7*) > Notice that these results on first exit times also hold for a = 1/2, which is an 
excluded value to obtain convergence of the Wang-Landau algorithm analyzed in Section [3] 
(see assumptions A3 and A4). 

Thus, roughly speaking, scales like | lnel 1 ^ 1- ") in the case a € [1/2,1) and like 

the case a = 1. In any case, the Wang-Landau algorithm is such that Ti-^ is 
much smaller than T\_>3 in the limit e — > (namely in metastable situations). 
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4.2 Numerical illustrations 



The aim of this section is to show that (most of) the results obtained in the very simple three- 
state model of Section 14.11 are still valid for a less simple example inspired by target measures 
used in computational statistical physics. In these numerical experiments, we also investigate 
the behavior of the algorithm for values of a in the interval (0, 1/2], which are excluded values 
to obtain convergence of stochastic approximation procedures in general, and in particular of 
the Wang-Landau algorithm analyzed in Section [3] (see assumptions A3 and A4). 

Our aim is to study the behavior of the exit times out of a metastable state as the tem- 
perature in the system goes to zero. The temperature will thus play a role similar to the role 
of e in the Section [4.11 (see formula ([3~T]) below, where ft is the inverse temperature). 

We consider to this end the system based on a two-dimensional potential suggested in |28) . 
The state space is X = [—R,R] x R (with R > 0), and we denote by x = (xi,x 2 ) a generic 
element of X. The reference measure A is the Lebesgue measure. The density of the invariant 
measure reads 

for some positive inverse temperature ft, with 
U(xi,x 2 ) = 3exp ^-xj - (x 2 - j - 3exp \ -x\ - (^x 2 - |^ J (30) 

- 5exp (-(xi - l) 2 - x%) - 5exp {-{xi + l) 2 - xfj +0.2x^ + 0.2 (^x 2 - . 

A plot of the level sets of the potential U is presented in Figure [TJ We introduce d strata 
Xi = (a,£, ci£ + i) x R, with a,£ = —R + 2£R/d and t = 0, . . . , d — 1. From Laplace's method, the 
ratio between the weight of the stratum in the transition region around x\ = and the strata 
located near the global minima of the potential U (namely x_ = (—1,0) and x+ = (1,0)) 
scales like exp(— ftfio) for some positive //q, in the limit ft — > oo. In view of ([23]) . we thus 
expect that the equivalent of the parameter e of Section 14.11 in terms of ft should be 

e(ft) = C exp(-/3/io) . (31) 

The aim of this section is to check numerically that, with this relation between f3 and e, the 
scaling behaviors we obtained in the previous section on exit times for the very simple toy 
model with three states are indeed also observed for a Markovian dynamics with local moves 
on the two dimensional potential U. Let us now make precise the dynamics we consider. 

The reference (non adaptive) Markov chain is obtained by a Metropolis algorithm, using 
an isotropic Gaussian proposal with variance-covariance matrix v 2 Id where Id is the 2x2 
identity matrix. This dynamics is metastable: it takes a lot of time to go from the left to 
the right, or from the right to the left (notice that the potential is symmetric with respect 
to the y-axis). More precisely, there are two main metastable states: one located around 
X- = ( — 1,0), and another one around x+ = (1,0). These two states are separated by a region 
of low probability. The metastability of the dynamics increases with f3 (i.e. as the temperature 
decreases). The larger ft is, the larger is the ratio between the weight of the strata located 
near the main metastable states and the weight of the transition region around x\ = 0, and 
the more difficult it is to leave the left metastable state to enter the one on the right (and 
conversely). We compare the reference (non adaptive) Markov chain to the associated Wang- 
Landau dynamics. In particular, the same proposal function in the Metropolis algorithm is 



16 



used for the Wang-Landau dynamics as for the reference dynamics. As in the previous section, 
the nonlinear update (jlOp is used. The step-size sequence is chosen as in (|25|) . The initial 
weight vector 9q is (1/d, . . . , 1/d). Notice that the reference dynamics corresponds to the case 
when 7* = (no adaption). 

Average exit times are obtained by performing independent realizations of the following 
procedure: initialize the system in the state Xq = ( — 1,0), and run the dynamics until the 
first time index N such that Xjy > 1. For a given value of the inverse temperature ft, the 
average exit time is obtained by averaging N over M independent realizations. This average 
exit time is denoted tp for the Wang-Landau dynamics, and tp for the reference dynamics. We 
use the Mersenne- Twister random number generator as implemented in the GSL library. For 
the numerical results presented here, we take R = 1.2, d = 24 and v = 0.1. The magnitude 
of the random displacements (which are of order v) is chosen in order to be comparable to 
the width of one stratum 2R/d = 0.1, so that from one stratum, the neighboring ones are the 
most likely to be visited. This is reminiscent of the dynamics used on the toy model in the 
previous section. We choose M such that the relative error on tp is less than a few percents 
in the worst cases. For computational reasons, M is of the order of a few hundreds for the 
largest exit times, while M = 10 5 in the easiest cases. 

Before giving the numerical results, let us state the expected scaling behaviors for tp and 
tp in the limit ft — >■ oo, in view of Lemma [4. 11 Proposition 14. 21 and (|3ip . First, the scaling (|26p 
implies that for the reference dynamics, under the relation (13ip (in the limit ft — )■ oo), 

tp-^CeMftvo). (32) 

Second, for the Wang-Landau dynamics, the scaling results (|28p - (f29]) imply that, under the 
relation (I3ip (in the limit ft — > oo): for a G [1/2,1) (and we will even consider a G (0,1) 
below) , 

( ^((iz^y /<i -" ) , (33) 

while, for a = 1, 

tp^C^expfft^-Y (34) 

In practice, the range of values of ft required to observe the asymptotic regime ft — > oo depends 
on the values of a and 7* (see Figure [2]) . 

Let us first check that we indeed recover the correct scaling behavior (|32p on the average 



exit times for the reference (non adaptive) dynamics. In Figure 2(a) , we plot, as a function 
of ft, the average exit time tp for the non-adaptive dynamics, using a logarithmic scale on the 
y-axis. The affine fit is very good, and yields an approximate value for the slope: jiQ ~ 2.32. 
We then plot ^ as a function of ft in the case a = 1 and 7* = 2 in Figure [2(b)| still using a 
logarithmic scale on the y-axis. As expected from (|34p . we indeed observe some exponential 
asymptotic behavior tp ~ C 7i exp(/3/x 7j J. This is true for other values of 7*. We report 
the corresponding slopes /z 7i for various values of 7* in Table [TJ Although the exponential 
dependence of tp on ft consistent with (I34p is reproduced, the exact dependence on 7* of 
the constant in the exponential predicted by the analytical example is not exactly observed 
here since /i 7 ^//Uo 7^ 1/(1 + 7*). In fact, /z 7i //zo is systematically larger than 1/(1 + 7*). 
We now turn to the case a G (0, 1) where we expect tp ~ Caft 1 ^ 1 " 01 ^ , see (|33|) . Note that 
we also consider the case a G (0,1/2) which was not covered by the theoretical analysis of 
Section T4. 11 To confirm the expected behavior, we plot tp as a function of ft in a log-log scale, 
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7* 






1/(1 + 7.) 





2.32 


1 


1 


1 


1.74 


0.75 


0.5 


2 


1.51 


0.65 


0.33 


4 


1.25 


0.54 


0.20 


8 


0.92 


0.40 


0.11 



Table 1: Update with step-sizes "f n = j*/n (a = 1). Exponents of the law tg ~ C 7i exp(/z 7t /3) 
for various values of 7*. 



a 


Ma 


1/(1 -a) 


0.125 


1.11 


1.14 


0.25 


1.30 


1.33 


0.375 


1.55 


1.60 


0.5 


2.02 


2.00 


0.625 


2.72 


2.67 


0.75 


4.06 


4.00 



Table 2: Update with step-sizes j n = n a . Exponents of the scaling law Tg ~ C a (3^ a for 
a G (0,1). 



see Figure 2(c)]]2(d) for the cases a = 0.75 and a = 0.125 respectively. We observe in all cases 



a dependence tg ~ C a j3^ a , the value of the exponent \i a being the slope of the affine fit in 
the log-log diagram. The estimated exponents are gathered in Table [2] for various values of a 
when 7* = 1. They compare very well with the value 1/(1 — a) predicted from (f33|) . On the 
other hand, we were not able to obtain a meaningful dependence of the prefactor C a on the 
parameter a. 



5 Proofs 

In the following, we denote by [x\ the integer part of x G K namely the integer such that 
[x\ < x < [x\ +1. We will also use the notation \x~\ for the integer such that \x] — 1 < x < \x~\. 

5.1 Proof of Proposition I3TT1 

We prove ([13]); the second assertion follows by |26[ Theorem 16.2.4]. Since q is symmetric, it 
holds by definition of the Metropolis kernel that 

Pg(x,A) > [ q(x,y) (l A^t) X(dy) > [ Mv) Wv) ■ 

Under A[2], infx q > 0. Furthermore, since 9{i) > and 0*(O > for any i G {1, . . . , d}, 

7T 

(•^-^ 0*(k)\ Isr^ it \ #(0 Xl sup x 7r 

"H77T SU P 2^ 'ar\ 1 ^ - SU P 1^ a c\ - : a~r\ ■ 

e ^ ie{L,...,di 
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(a) Reference dynamics (logarithmic scale on the y- 
axis) 




(c) q = 0.75 (logarithmic scale on the x and y-axis) 




(b) a — 1 and 7* = 2 (logarithmic scale on the y- 
axis) 

100000 p . . . . . . -q 



10000 - 

-to 

1000 - 

100 Li 




1 10 100 1000 

(d) a = 0.125 (logarithmic scale on the x and y-axis) 



Figure 2: Average exit time as a function of f3 for various dynamics. 

The right-hand side is finite by A[T]and does not depend upon 9. Therefore, (|13jl holds with 
p d = inf x2 g (sup x 7r) _1 mini<i< d ^(i). 

5.2 Proof of Theorem I3T21 

Throughout this proof, it is assumed for simplicity that the sequence is {j n ,n > 1} is non- 
increasing. The general case (i.e. when { r y n , n > ^o} is non-increasing) is a trivial adaption 
of the following lines. 

Define the smallest index of stratum with smallest weight according to 8 n i.e. 

I n d = min{i : 9 n (i) = 6 n } , (35) 

where 9 n is given by (|15p . We also introduce the stopping times as the times of return in 
the stratum of smallest weight: Tq = and, for k > 1, 

T k = ini{n > T fe _i : X n £ X/„} , 

with the convention that inf = +00. With these notations, Theorem 13.21 is implied by the 
following proposition, the proof of which is the goal of this section. 
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Proposition 5.1. Under 421 41 and 41al it holds 



\fk G N, Tfc < +ooJ = 1 , (36) 
P (limsup0 Tfc _ 1 > ) = 1 , (37) 

V fc— >QO / 



where 9 n is given by U5\) . 



When finite, the stopping times are such that 9_ Tk — Q_T k -i admits a known increase. 
Indeed, by the update rule ([7]), 

®T k -l(lT k ) = 7— /-, Tfc ^ — 77 — vT < ®T k {lT k ) 

l + 7T fe (l-^T fc -i(ir fc )) 

Qrp (j) 

< . m j n ^(i) < min — — - = min 9 Tk -i(j) , 



so that 



I Tk -i = I Th and 6 Tk = 9^(1 + 7 T fc (l - £ Tfc -i)) • (38) 

In the evolution from Qr k —l to #T fc i—l) the increase provided by the return to the stratum 
X/ Tfc at time T/% compensates the decrease of 6 n generated by the subsequent visits to the 
other strata for n € + 1, . . . ,Tk+i — 1}, provided that Tk+i — is small enough. This 
is indeed possible since the decrease arises from multiplicative factors 1 — j n 9(I(X n )), where 
"f n 9(I(X n )) is typically much smaller than the term 7r fc (l — Q.T k -i) appearing in (138|h 



5.2.1 Proof of ((36D 

To prove the first assertion, we proceed by induction on k and suppose that P(T^ < +oo) = 1. 
This assertion is true for k = 0. To check the condition P(Tfc + i < +oo) = 1, we are going 
to construct a specific sequence ensuring that X n returns in the stratum of smallest weight 
at some point (see (j39|) below), and show that this sequence has a positive probability of 
occurrence (see Lemma 15.21 below) . 

For m G N, let 9 Tk +md(SX)m) < Tfe +md((2)m) < • • • < 9 Tk +md{{d) m ) denote the increasing 
reordering of (9 Tk + m d(i))i<i<d (notice that 9 Tk+m d((^)m) = i.T k +md), and define i m = max{i < 
d : 9 Tk+md ((i) m ) < 9_ Tk+rnd {l + 7i)/( 1 ~ 7l)}- Tne indices (l) m , . . . , (i m ) m are all the indices 
of the strata with weights close enough to the minimal weight. We then consider the sequence 
obtained by visiting successively the strata with indices {i) m for i < im, i n decreasing order. 
This corresponds to the event 

A m = |x Tfc+md+ i e X(i m ) m ,X Tk+md+2 e X(j m _!) m , . . . ,X Tk+md+im £ X(!) m | . (39) 

On A m , the weights are not updated for j > i m + 1, so that 

0T k +md+im-l(ti)m) _ 0T k +md((j)m) > 1 + 7l 1 + TZfc+mrf-Mm C 1 ~ %+md+i m -l ( ( 1 )m)) 
6 , T fe +md+i m -l((l)m) 6 , r fe +m<i((l)m) ~~ 1 ~ 7l 1 ~ lT k +rad+i m #T fe +md+i m -l ((l)m) 

where we have used successively the definition of i m and A [3fel The inequality between the 
left-most and right-most terms rewrites 9 Tk+rnd+ i m ({j) m ) > 9 Tk+md+im {(l) m ). Now, for j £ 
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{2,... ,i m }, it holds on A m 

lT k +md+i m +l-j 



1 + 



0T k +md+i m {(j)m) _ 0T k +md{(j)m) 1 ~ lT k +md+i m +l-j^T k +md+im-j(U)m) 

X 



9 Tk+md+lm {{l) m ) e Tk+ md{{l) m ) 1 + lT k +md+i m 

1 ~~ lT k +md+i m 9T k +md+i m -l{(l)m) 

The second factor on the right-hand side is larger than 1 on A m since jT k +md+i m < lT k 
by Al3a1and. using the fact that the stratum Xm m is not visited until the last step, 

6T k +md+im-l{(X)m) < T +md+im _j((l) m ) = Tk + m d+i m -j((j)m,) x T ^ +md ^ 

VT k +md{{3)m) 

< ^T k +md+i m -j{(j)m) ■ 

Therefore, the stratum with smallest weight at iteration T k + md + i m is still (l) m , which 
means that I Tk+md+im = (l) m on i m and 

on A m , T k+ i < T k + md + i m < T k + (m + l)d . (40) 

To deduce that F(T k+ \ < +oo) = 1, we use the following lemma (whose proof is postponed to 
Section EM]). 

Lemma 5.2. Under 43 and 4H there exists a constant p € (0, 1] not depending on k such 
that almost- surely 

V?n e N, ¥(A m \T Tk+ md) > V ■ 
Lemma [5,21 implies that, for m E N, 

F(T k+1 > T k + (m + l)d) < F(A C n A\ n . . . D A C J 

= E (l{A,'5n J 4=n...n J 4^_ 1 }(l - F(A m \F Tk +md] 

< (i- P )P(iSni;n...n4_ 1 ) , 

which inductively leads to F(T k+ i > T k + (m + l)cZ) < (1 — p) m+1 . The conclusion follows by 
taking the limit m — > oo in the latter inequality. 

5.2.2 Proof of (E 



The proof of the second assertion relies on the following lemma (proved in Section I5.2.3P . For 

k > 1, set 

G k = F Tk , Y k ^9 Tk _ x , (41) 



where 9^ is defined by (|15p . 

Lemma 5.3. Lei « : (0, 1] 9 t H> - ln(i) G R+. Assume that 43 41 J^Mhold. Then, 
there exist fcsN and y £ (0, 1) such that almost- surely, 

Vk>k, Y k <y^M(v{Y k+1 )\g k ) <v{Y k ) . 
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We then define by induction stopping times a m and r m as follows: o"o = 0, and for m > 1 
(with the convention inf = +oo), 

T m = mi{k > (7m-! : Y k < y}, a m = ini{k > r m : Y k > y} . 

All possible events can then be classified using the following partition of the underlying prob- 
ability space : 

{3m > : a m < +oo = r m+ i} U {Vm > 1, a m < +00} U {3m > 1 : r m < +00 = a m } . 

On the first two sets, Y k > y infinitely often so that lim sup^^ Y k > y. To deal with the 
last set, one remarks that for each m > 1, the process (i>(YfcA<r m ) ~~ v (YkAr m ))k>k is a Qk~ 
supermartingale by Lemma [5. 3 1 and is not smaller than —v{Y Tm ) > — v{y(l — 71)) by positivity 
and monotonicity of v and definition of r m . So this process converges almost surely to a finite 
limit V m as k — > 00. As a consequence, on {3m > 1 : r m < +00 = <7 m }, (!&)& converges a.s. 
to E m >i 1 {T m <+oo=a m }Yr m e- Vm . In conclusion, P(limsup A .^ oo y fc > 0) = 1. 

5.2.3 Proofs of some technical results 

We now provide the proofs of the previously quoted lemmas. 

Proof of Lemma \5.S\ By AQ] and AEl the constant c = f ™ f x2 - is positive. The main 
ingredient of the proof is the following lower-bound: for all i G {1, . . . , d} and x G X, 

£, ( ,,,,( 1A !M)^* W (!^ A1 ). (42) 

For j G {1, . . . ,i m — 1}, it holds on {X Tk+md+1 G X (im)m , . . . , A" Tfc+ . md+j G X (im+1 _ j)m }, 

6T k +md+j((im + 1 — j)m) 

9T k +md+j{{im — j)m) 

_ ^T k +md{{im + 1 - j)m) 1 + 7T fc +md+j ( 1 ~ 0T k +md+j-l ((«m + 1 — j)m) 

X 



^T k +md({im - j)m) 1 ~ jT k +md+j8T k +md+j-l({im + 1 — j)m) 

Both factors on the right-hand side are larger than 1 (the first one by definition of the ordered 
indices (i) m ), so that, by ()42j) . 



P 0T fc +md + j ( X 7fe+md+j,X(j m _ i)m ) > c0*((i m - j) m ) > c0+ , 

where 9 is defined by (fT5|) . Using successively the strong Markov property of the chain 
(Xn, @n)ni a backward induction on n, the definition of i m , together with (1421) . 



F(A m \T Tk+m d) 

= E (l{x Tfc+rod+1 ex ( ^ )m ,...,x Tfe+m ^ 

> c i* E ( 1 {x Tfc+md+1 ex (lm)m ,...,x Tfe+md+Im _ 2 GX (3)m }^ Tfc+w ^ J r fc +mrf) 

> (c^)* m_1 i^ Tfc+md (^T fc + m d,X( im ) m ) 

l + 7i l + 7i 
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\ — 71 d 

The proof is therefore concluded by setting p = -^—^ — ( c @.*) ■ 

Proof of Lemma \ 5. 3\ Note first that, apart from the case when X n E X/ n , the other situation 
ensuring that 9_ n+l > 9 n is the case when the chain visits the stratum of smallest weight, but 
the weight of this stratum is then increased while the weights of the other ones are decreased, 
in such a manner that this stratum no longer remains the one with smallest weight. In 
mathematical terms, X n+ \ E Xj n and 

On <min , n(j) <^(l + 7n +1 (l-0 



1 - 7n+l# n j^In 1 - Jn+lin 

where the first inequality actually implies that Q_ n+ i > 9 n and the second one that X n+ \ ^ 
x /„+i- 

Define = inf{m > 1 : uj E A m ^i}. Then, recalling that Y k = 9 T 
Y k+l >Y k {l + lTk (l-Y k )) (1 

n=T k 
T k +ud-2 

>Y k (l+ lTk (l-Y k )) J] (l-lT k 0n(HXn+l))) (43) 

n=T k 

where, as discussed before Theorem 13. 21 the first factor comes from the definition of T k (see 
(|38p ): the first inequality from the possibility that for some n E {T k , . . . , T k+ \ — 2}, X n+ \ E Xj n 
and 7 n +i 7^ I n \ and the second inequality from Al3a1and the fact that T k+ \ <T k + fj,d by (|40p . 
By Lemma 15.21 fi is smaller than some geometric random variable with parameter bounded 
from below by the positive constant p. Therefore, the number of terms smaller than 1 in the 
product on the right-hand side of (j4"3|) is small. If Y k is chosen small enough, then 7r fe (1 — Y k ) is 
much larger than 7T fe ^fc an d if 9 n {I{X n+ i)) remains of the same order as 9T k ~i(I(Xx k )) = Y k 
for n E {T k , . . . , T k + fid — 2}, then, in average, Y k+ \ will be larger than Y k . Unfortunately, 
since for 9 E 0, maxi<j<d 9(i) > g, #T fc W is large for z in some subset of and 
we need to control the probability for (X n )x k +i<n<T k +fid-i to visit the corresponding strata. 
Under ACQ C d = < +00 and, for all i ^ j E {1, . . . , d} and x E X,, 

P.tx.X,) = ,(*,,) (l A «) A( dri < C (M A x) . (44 ) 

This ensures that the conditional probability to choose X n+ \ E Xj with large weight 9 n (j), 
given X n E Xj with low weight 9 n (i), is small. 



To quantify this intuition, define zy fe E argmax 1<i<rf _ 1 — ^ — 7TT~\ — • Since 

0T fc (Wo) 



n 

t=i 

it holds 



_^((* + l)oj 

%((* + l)o) _ maxj % (i) 1 
% ((i)o) " ^(1 + 7T fc (1 - n)) " 2dY k 



h f7r \ l \ 0) * VdY k yW-V 

d T k {{iT k )0) 
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so that, for all i E {1, . . . , 



%.(«o) < 0T k ((i Tk )o) = , °]i {lTk }t x x %((z Tfe + 1) ) < (2dy fc ) 1 /(^D . 



Hereafter, the set 



X(fc) = U^+iXfl,, , (45) 



plays the role of the union of strata with large weight according to 0T k ■ Define, for m 6 N, 

(/i i \ (m+l)d\ 



(46) 



Then, 



max,<j„ ^Tt+n((i)o) 

Vm G N, Vn < (m + l)d, . fc * ttt^ A 1 < p m , (47) 

minj>j Tfc+ i e Tk+n ((i) ) 

which implies 

VjG{l,...,i Tfc },n<(m + l)d, %+n((j')o) < Pm • (48) 
Using (|43p . the definition of /i, then the inequality — ln(x) < — — 1, it follows 

E(v(Y k+1 )\G k ) - v(Y k ) + ln(l + 7Tfc (l - Y k )) 

(/T k +(m+l)d-l 
In J] (1 - lTk 6 n {I{X n+1 ))) ] l { Agn...nA^„ 1 nA m }|a fc 

\ n=T k 
I (T k +{m+\)d-\ 

<"E E h II - 7T fc 0n(/(X n+1 ))) ] l {A gn...nA^ l} l^ 

mGN V V "=T fc 
m 

where the numbers E m i are defined by decomposing the possible events using the partition 

Bo, Bq n B\ , . . . , Bq n • • • n B^_ 2 n -B m _i, 5g n ■ ■ ■ n s^-i : 

^(((1 - 7T fe )" (m+1)d - l) l { A S n...n^_ 1 n Bo }l^) fol ' * = 0- 

^(((1 - 7T fe )" (m+1)d - l) l { Agn...n^_ 1 nBgn...nBf_ 1 nB ! }l^) for < Z < m, 

EM (1 - 7T kPm r {m+1)d ~ lj l {j 4gn...n^_ 1 nBgn...n^_ 1 }l^) for i = m. 

The inequality (j48]) was used for the case I = m. 
Let 

_def ( H4d 2 Y k ) 

2d(d-l)ln(^ 



-Emi — < 
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Note that m is chosen so that p m < Y^ 2 ^ d ^ for < m < m. Besides, we may assume that 
k is large enough so that 1 — p < (1 — "fT k ) d , P being defined in Lemma 15.21 (indeed, T k > k, 
and lim ri ,_ s>00 7 n = by AEfcj), Therefore, using Lemma [5. 21 for the second inequality, 



E E ™™ < E ((i - iT k Yl /2{d - l) r {m+l)d - 1) viM n • • • n A^\g k ) 
+ E (! - 7T fe )" (m+1)d P(^ n . . . n A c m ^\g k ) 

m>fn 

< E(( 1 -7T fc n 1/2(d " 1) r (m+1)d -i)a-pr+ Ea-7T fc )- (m+1)<i (i-p) m 

l-(l-7T fc n 1/2(d - 1} ) rf , 1 ( l-p V KJ+1 . (50) 



p(p + (i _ 7Tfe yV2(d-i) )d _ 1} p + (i _ 7T J^ _ i ^(1 - 7r J 

The terms E m i for / < m can be dealt with using the following lemma, the proof of which 
is postponed to the end of this section. 

Lemma 5.4. Let A m ,B m and Q k be given by \39\) , \4(fy and ( f^-/[ ). For < / < m, one has 

F(A C n . . . n A c m _ x n B c n . . . n Bf_ t n B^) < Cd Pl (l - p)™- 1 . 

This lemma ensures that for I < m, E ml < Cd((l - -y Tk )~ {m+1)d ~ 1)(1 - p) m ^V By 
Fubini's theorem (the terms of the sums below are non-negative) and a reasoning similar to 
the one used above to estimate the sum ^ mgP j£ mm , we obtain 



t^EE^E^ E ^-iT k r^ d -i)(i-v) m - 1 



m£N 1=0 ZeN m>l+l 

'(i-TT t r i ((i--iT t r d (i-p))' (1-pY 



p+(l-7T fc ) d -l p 



<1^> r 1 , - X) + — 1 — 



\ (p+ (1 - 7 rJ d - I) 2 fV " (p+ (1 - 7T fc ) d - I) 2 V(l " 7T fc ) d 



(51) 



Since T k > k, and limn-yoo 7 n = by A l3fel there exists a deterministic constant k such that 
r k ) d > 1 - | and ln(l + 7Tfc (1 - y fc )) > 

/l+TTfcA < 2 7Tfc 
Vl"7T fc y - l~7T fe 



for fc > k, (l-7r fe ) d > 1-| andln(l + 7 Tfc (l-Yfc)) > 2^.(1 _y fc ). In view of ggj-PJ-J5TJ, 
the definition of m and In I , — < , — there exists a finite constant IT such that, for 



k > k and Y k < I /Ad 



■2 



E{v(Y k+l )\g k ) - v(Y k ) < -2^(1 - y fc ) 



i, a( *-i) ^ / (l-lT k )ln(^j- 2 )H^Y k ) 



+ K \ lTk Y,' ^+exp 



4d(d - l)7 Tfe 

This implies that there exists y € (0, l/4d 2 ] such that, for all k > k, 

Y k <y^E(v(Y k+1 )\g k )<v(Y k ), 



2G 



which concludes the proof of Lemma 15.31 

Proof of Lemma \5.4\ Let us first consider the case when I > 1 . By Lemma 15.2 



F(A C n . . . n A^-i n B c n . . . n Bf_ x n B t \g k ) 

< (l - p)™- 1 '^ (i { ^ n ... nAf _ i} i {Bf _ i} P(^|J- Tfe+ w)| a fc 

To conclude, it is therefore enough to check that 1{_b ; c |^T fe +/d) < Cdpi. Now, 

l {fl c_ i} P(B,|Jr fc +w) < l { x aib+I ^x(fc)}P^ +w (^+w,X(fe)) 

+ nx Tk+ id+i i X(fc), . . . , x Tfc+M+n £ X(fc),x Tfe+w+n+1 g X(fc)|^ 



71=1 

d-1 

< YE(t {XTk+ld+nmk)} P eTk+ld+n (X Tk+ i d+n ,X(k))\T Tk +id), 



n=0 

where 1{X T +H+n ^X(fc)}-P6» Tfc+H+n (^T fe +/d+n) X(ife)) < by (01| and flU). 
For the case / = 0, we use again Lemma [5.21 to obtain 

p(A c n . . . n A c m _ x n B Q \g k ) < (1 - pJ^P^ol^)- 

The second factor is still bounded from above by Cdpi since Xx k which gives the 

claimed result. 

5.3 Proof of Theorem KT51 

We start by proving that the function defined in (1171) is a Lyapunov function for the mean- 
field h given by (|12|h 

Proposition 5.5. Under yQ 

aj V is non-negative and continuously differentiate on 0. 
b) h is continuous on and given by 

w=(gw) (52) 

/or any M > 0, {fl £ 6, V(0) < M} is a compact subset of 0. 
ay /or any E 0, (W(0),fc(0)) < 0. In addition, {9, (VV{9),h(9)) = 0} = {0*}. 
Proof, (jaj) It is trivial to check that V is C 1 on 0. By Jensen's inequality, 

V{6) = -J2 0*® log (J^j > - log (j2 0@\ = • 
© For any i G {1, . . . , d}, we have by © and (fl2|), 
= / fli(x,0) vr e (x) A(dx) = 



i) /" 7r e (x) X(dx) -0(i)J^0(fc) f Tt e (x) \{dx) 

J k=l 
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The property (|52p now follows upon noting that, by definition of irg (see ([6])), 

e^Y 1 0*(fc) 



•'Xfc \i=1 



p Set M' = M - X^Li <?*(*) log (?*(<)■ Observe that, by AHJ M' > M > 0. By definition of 
V (see (H7D), 

9 G 6, - ^ log 0(0 < M' I C P| {9 G 0, 0(j) > m} . 

i=i J 3 -=i 

with m = f exp(— M'/inffc d+(k)). Therefore, for any M > 0, there exists m > such that 

< M} C G 9, m < inf 0(i) < sup9(i) < l} . 

Since V is continuous, {V < M} is a compact subset of 0. 

((dj) By definition of V and h (see (jTTJ) and (|52p ). a simple computation shows that 

(w(0),/i(0)) = - (E E ^(**(o " 

where we have used £^=1 — $(0) = to obtain the second equality. It is also clear from 
the above expression that the scalar product is null if and only if 9 = 9*. □ 

We now wish to prove that the increment 7 n +i (H(X n+ i,9 n ) — h(9 n )) in (fT6|) vanishes in 
an appropriate sense. To this end, we need some preliminary results and we rewrite the update 
of the weights as 

On+li'l) 



?n(0 



1 + j n+1 Y n+1 (i) , (53) 



where Y n+ i(i) = f lx i (X n+ i) — 9 n (I(X n+ i)) satisfies |Y„ + i(i)| < 1. This key formula says that 
the difference n +i(z) — 9 n (i) is not simply of order of the step-size 7 n +i but of order n (i)7n+i 
which permits to circumvent the explosive behavior of the various estimates obtained in the 
next lemmas as mini<j<d#(i) tends to 0. 



Lemma 5.6. For any 9,9' G 6, 

\\ir d\ - Tr e ,d\\\ Ty < 2(d - 1) E 

i=l 

Proof. By definition of itq (see ©), 



1 0'(i) 



9(i) 



7Tg(x) = > j - ... lx (X) 
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Hence, 



\\lTgd\ - 7T /dA||TV < 



We denote by N(9, 9') the numerator of the expression of the right-hand side of the previous 
inequality. Then, 



\9>(i)e(j)-6(i)8>m 

9(i)e>(i)e(j)e>(j) 



m - m 



9(i)9(j)9'(j) 



i =1 i+3 



j=i tyj 

For the denominator, we use the lower bound 

d d 
Vi,je{l,...,d}, J>(AO/0(*)] $>*(O/0'(O] 



> 



e{i)9>{z)9{j) 



fc=i 



z=i 



Therefore, 



||7r fl dA - 7r fl /dA||TV < 2 ^2Y1 

which gives the claimed result. 



e{j)-m 



0{j) 



5=1 



Lemma 5.7. For any 9,9' € Q and any x G X suc/i t/iat irg(x) < ttqi[x), 



\Pg(x,-) -Pg,(x,-)\\ TV < 2 2 sup 

V »€{!,...,<*} 



^(0 



+ sup 
i6{l,...,d} 



Proof. For any x € Xj and y £ X^, we have by definition of 7rg (see ([6])) 



7r fl (x)7r fl /(y) 9(k) 9\j) 



Mx) 9'(j) 



Tr (y)ir e >(x) 9(j)9>(k) ' tt ,(x) 9(j) ' 
Since Pg is a Metropolis kernel, for any bounded measurable function /, 



\Pef(x) - P e >f(x) 



q(x, y) (a e {x, y) - ag>(x, y)) (f(y) - f(x)) \{dy) 



< 2 sup |/| sup | ag — Otgi I , 
X X 2 



with ctg(x,y) = 1 A (irg(y)/iro(x)). Let us distinguish all the cases: 
• Kg(y) < 7rg(x) and -Kg,{y) < ixg>(x). Then, 



\ag(x,y) - ag>(x,y)\ 



< 







TTg(x) 


no>(x) 


We(y) 


-ng>{y)\ 



< - ir e >(y)\ \irg(x) - irg,(x)\ 



TTg(x) 



TTg(x) 



irg(y) 



+ 



TTg(x) 



< 2 sup 

X 



Kg 



2 sup 

ie{i,...,d} 



9(i) 



□ 



(54) 
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where we used (|54[) in the last equality. 

• 7Te(y) < 7re(^) and TTgi(x) < 7Tg>(y). Since 7r#(x) < 7r#/(x) < irg>(y), it holds 



- a e /(x,y)| = 1 — < 1 < sup 

7T (x) 7T0'(yJ X 



1-2- 



sup 

ie{i,...,4 



1 



*(<) 



7Te(x) < 7r e (y) and tt 6 >(x) < 7r e /(y). Then, |a 9 (x,y) - a 9 /(x,y)| = 0. 
fte(x) < 7re(y) and irg>{y) < n$'(x). Then, using again (p3 



, / \ , \ I . 7Te'(y) / -, 7r e /(y) vr e (x) vr e /(x) - vr e (x) vr e (x) vr e (y) - n e >(y) 

\ag{x,y) - a e >{x,y)\ = 1 < 1 — = -— 1 -— 

iTe'{x) 7Te'{x) 7re{y) vr e /(x) ng>{x) 7rg(y) 



< sup 
ie{i,...,d} 



1 



r(i) 



9(i) 



+ sup 
i£{l,...,d} 



1 



This concludes the proof. 

As a corollary of Lemmas 15.61 and 15.71 we obtain the following result. 



□ 



Corollary 5.8. Under y Qal and y4 [HcI i/tere exist a constant C and N S N suc/i f/iaf, for any 
n > 0, 

||7T0„dA - 7r 0n+1 dA|| T v < 2rf(d - 1) j n+1 , (55) 



anc?, for any n > N , 



sup||P<9 n (x,-) - Pe n+1 (x,-)\\ T y < C 7„+i. 



Proof. The inequality (|55[) immediately follows from Lemma 15.61 (|53p and the upper bound 
|^ri+l(*)| < 1 ( see (|53p ). In addition, by Lemma 15. 7| 



\P 9n+1 (x,-)-Pe n (x,-)\\TV <4 sup 



?n(0 



+ sup 



< 47 n+ i sup |y n +i(i)| + sup ■ 



0n+l(i) 
\Yn+l(l, 



1 + j n+ iY n+1 (i)\ J 

The proof is concluded upon noting that |Y„+i(i)| < 1 and 1 — 7 n +i > 1/2 (for instance) for 
n sufficiently large. □ 

Lemma 5.9. Assume AH\ and Then, for any 9 G G, there exists a function Hg solving 
the Poisson equation Hg — PgHg = H{-,9) — irg(H(-,9)) = H(-,9) — h(9). In addition, 



sup 

0ee,a;€X 



Hg(x] 



< oo , 



and there exists a constant C such that for any 9,9' £ 0, 



sup 
x 



Hg — Hg> 



+ 



PeHg — Pg'Hgi 



< C 



mi {9(i)A9'(i)} 

iG{l,...,d\ 
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Proof. Since sup^gQ sup^gx \H(x, 6)\ < 1, the results of Proposition 13.11 show that Hg exists 
for any 6 £ and (see e.g. \26\ Section 17.4.1]) 



sup sup 

6»eOxGX 



Hg(x) <snpsnpJ2\ P eH(;e)( X )-TTg(H(;e))\ < 



(56) 



n>0 



In addition, in view of Proposition 13.11 and |11^ Lemma 4.2.], there exists a constant C such 
that, for any 6,6' £ 0. 



sup 
x 



PnHg — PqiHoi 



sup 
X 



Ho — Hai 



< C (sup \H(; 6) - H(; 6')\ + sup \\P g (x, •) - Pg>(x, -)||tv + IMA - Trg,d\\\ T v) 
\ x zex / 

By definition of H (see ©), there exists a constant C" such that for any 6, 6' £ 0, 

sup !#(-,#) - fl"(-,0')| ^ C'|0-0'| . 
x 

The proof is then concluded by Lemmas 15.61 and 15.71 
Proposition 5.10. Assume A\E and 43 Then, almost- surely, 



□ 



limsup sup 

fc— >+oo l>k 



^ 7n+ i (H(X n+l ,6 n )-h(6 n )) 

n=k 



. 



Proof. We decompose the increment into a martingale term and two remainders, using the 
function H defined in Lemma 15.91 



with 



H{X n+ i,6 n ) - h(9 n ) — Hg n {X n+ i) - Pg n Hg n (X n+ i) — M n+ i + R^\i + R^+i , 



M n+ i = Hg n (X n+ i) - Pg n Hg n (X n ) , 

^i+l = P 8 n H e n (X n ) - Pg n+1 Hg n+1 (X n 
,(2) 



+1, 



R n+l - Pe n+1 Hg n+1 (X n+ i) - Pg n Hg n {X n+1 ) . 
Observe that (M n ) n >i is a martingale-increment such that Xm7n^ [l-^n| 2 ] < °o by Al3bland 



almost surely. 

to 



[56]) . Hence (see e.g. [12] Corollary 2.2]) limsup fc sup^ >fc Yln=k"fn+iM n t j 

Consider now Xm=fc7"^i+l- Note that R^Xi is a telescopic sum. We therefore resort 
Abel's transform, and obtain 

I i 

J2^nR { nl 1 =7kPe k Hg k (X k )- 7e Pg e+1 Hg e+1 (X e+1 )+ ^ ( ln - 7n ^) Pg n Hg n (X n ) . 
n=k ro=fc+l 

In view of (|56p and Al3fai there exists a constant C such that for any £ > k, 



j=k 



< C j sup jj + ^ |7 3 - - 7 3 -_i| J < 2C 7fc 



j=fc+i 
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almost surely. 



Assumption Al3a1then implies that limsup fc sup i>k Yln=k ln+iR n+ i 

We finally turn to Y^ n =k^ n ^n+1' Lemma 15-91 combined with assumption A[2] imply, after 
manipulations similar to the ones used in the proof of Corollary E3J that there exists a constant 
C" such that for any j > 0, 



sup 
x 



Pe J+l H 0j+1 



Pe j H 9j 



< c" lj+l 



Then, by assumption Al3fc"| ^2 n 7n\Rn \ exists almost-surely, which implies that 



limsup sup 

k £>k 



n=k 



This gives the claimed result. 



□ 



The proof of Theorem 13.31 is now concluded by resorting to [H Theorems 2.2 and 2.3.]. 
Theorem 13.21 and Propositions 15.51 and 15 . 101 prove that the assumptions of these theorems hold. 



5.4 Proof of Theorem 13.41 



Proof of (|18p . The proof is based on |11[ Theorem 2.1]. We check successively the assump- 
tions required to apply this result. First, the condition Al of |11] holds since ttqPq = TTg by 
assumption A[2] 

We now turn to condition A2 in |11| . Fix e > 0. By Proposition 13. 1| 



E 



dX\ 



TV 



< e 



by choosing r £ > ln(e/2)/ln(l — p). The constant sequence r e (n) = r £ is non-increasing and 
obviously satisfies r £ (n)/n — > 0. Furthermore, by Corollary 15.81 there exists a constant C 
(independent of e) such that 



sup || P 9 n _ r .Ax,-) - Pe n _ r Ax,-)\\ T y 



r £ -l j-l 
3=1 £=0 



sup||P ( j tt _ re+tfl (a; ) -) - Pe n 



t M\\ 



TV 



r e -l j-l 
< C ^ ln-re+t+1 

j=i e=o 



n— »+oo 



since the last sum is composed of a finite number of terms, each of them going to in view of 
assumption AlSfelThis gives that condition A2 in [TT] holds. 

Finally. Theorem 13. 31 and Lemma [5~6l imply that linir, J f(x)irg n (x) \{dx) = f f(x)7To ir (x) X(dx) 
almost-surely. 



Proof of (fT9l) . We check the conditions of |111 Theorem 2.7.]. First, the condition A3 of |11| 
holds with V = 1 (with the notation of |11] ) in view of Proposition 13.11 Observe indeed that 
since V = 1, PgV(x) = 1 = c + (1 — c) for any c G (0, 1) thus showing the drift inequality. In 
addition, by Proposition 13.11 Pg(x,A) > p J A irg(x) X(dx) for any x G X,A £ X: this implies 
(i) the minorization condition on the kernel Pg 7 (ii) ttq dX is an irreducible measure and Pg 
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is psi-irreducible, (in) and Pg is strongly aperiodic since X is small for Pg (see |26[ Section 
5.4.3]). 

In addition, by Corollary I5.8| there exists a constant C such that 



Ejg sup||P e 



k>l 



cEf^ E ( 7?+ Ej< 

k>\ k>l v 7 



OC 



by Al3bl This shows that the condition A4 of jll] holds. Finally, the condition A5 of is 
trivially satisfied in the case under consideration (since V = 1 with the notation of 

5.5 Proof of Theorem 13.51 

Proof of ([20]). We write 



E 



E*»(*) f(X n ) l Xi {X n ) 



i=l 



E 



E{*»(O-0*«} /(x n ) i Xi (x n ) 



,i=l 



+ ]T^(*)E[/(X n ) l Xi (X n )] 



Theorem 13.31 and the dominated convergence theorem imply that the first term in the right- 
hand side converges to zero. By Theorem 13. 4| the second term converges to 



* A\ 1 



firdX. 



Proof of (|21j) . We write 

d n d n 

2n(/)=E (*»(0 -**(*)) E +E E 



We have 



i=l 



fc=i 



i=l 



fe=l 



E 

i=l 



^(*)) e/mi^k 



fe=i 



<su P |/| E |0n(O-^*(i 



i=l 



and the right-hand side converges to zero almost-surely by Theorem 13.31 In addition, by 
Theorem [ 



lj2f(X k )t Xi (X k )^ [ f« e J\=^—f fn 



This concludes the proof. 



5.6 Proof of Theorem 13.61 

We write H(X n+1 ,0 n ) = h(0 n ) + e n+ i + r n+1 with 

e n +i = H dn (X n+ i) - Pg n Hg n (X n ) , r n+1 d = Pg n Hg n (X n ) - Pg n Hg n (X n+1 ) . 

The result follows from |10l Theorem 1.1]. We check below the various conditions necessary 
to apply this theorem, and finally establish the expression of the limiting variance. 
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Condition CI. The vector 9* is a zero of the mean field h in view of (|52p . and h is twice 
continuously differentiable in a neighborhood of 6* under A[TJ From (|52p . it is easily checked 
that V/i(0*) = — so that V/i((9*) is a Hurwitz matrix. This gives condition CI of |10[ 
Theorem 1.1]. 

Condition C2. By definition, {e n ,n > 0} is a martingale increment and by Lemma 15.91 it 
is bounded, so that conditions C2a and C2b of 1 10|, Theorem 1.1] follow. We now consider 
C2c. A simple computation shows that E [e^ie^-J J^j = E(X k ,9 k ) with 



E(x,0) 



dcf 



Pe(x,dy) H 6 {y)H {yf 



P e (x,dy)H e (y)) / P e (x,dy)H e (y) 



We introduce the function Eq solution of the Poisson equation 

E s (x) - PeEg(x) = E(x, 0*) - / E(x, 0*) ir e (x) X(dx) 



Since swp x \E(x,6+)\ < swp g x Hq(x) < oo (see ([56]) ). by Proposition 13.11 and [26] Sec- 



tion 17.4.1], such a function exists and sup^g, xS x E e( x ) 
with x replaced by X k and 9 replaced by k —i, we obtain 



< oo. Using the previous equality 
H(X fe A)- J 3(*A) Tr e Ax)X(dx) = {E(X k , 9 k ) - E(X k , 9,)} 



+ (^j E(x,6+) Tr ek _ 1 (x)\(dx) - J H(xA) ir 6it (x)\(dx) 
+ (3e fe _ 1 (X fc ) - Pe k E ek (X k )) + [P 6k Ee k (X k ) - P^Eq^X^) . 



The terms on the right-hand side should be small. This motivates therefore the following 
decomposition: E [efc+iejT +1 | Tk] = U* + D^' + D^' with 

t+ = [ 3(*A) 7r e ^x)\(dx) , D® d ^ f Eg^iXk) - Pe k E 9k (X k ), 



and 



D 



(i) 



(spT fc A) - 3(X fe A)) + / 3(*A) (^(x) - tt^(x)) A(di) 



+ ^ h S„ k (X fc )-P (?k _ 1 3 flh _ 1 (X fc ) 
Denoting by T = {lim g 9 q = 9*}, we first prove that 



(57) 



lim 7 n E 

n— >+oo 



. 



(58) 



E^ 2) 

fc=i 

To this end, we decompose this sum as 

n n n 

E-°i 2) = ^2 { E 6 k -i( x k) ~ Pok-^ek-tiXk-i)} + E { p 9k-i E 8 k -i( x k-i) - Pe k Z 9k (X k )} 

k=l k=l 

n 

'Y] { s 6»ib-i(^fe) - jPefe-iH^^Xfe-i) j + P 9o Ee a (X ) - P en E 9n (X n ) . 



k=l 



k=l 
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Since sup eee a . eX E e (x) 



< co, the last two terms on the right-hand side of the above equality 



are such that 7 n E 



P 8o E 0O (X ) - P 0n Eg n (X n ) 



0. The first term is the sum of martingale 



increments : by |12^ Theorem 2.10], there exists a constant C such that 



supE 
n 



n 

n 



< sup E 



1/2 



Since lim n 7 n y / n = 0, this concludes the proof of (|58j) . We now prove that on the set T, 



D 



(!) ±b. 



. 



(59) 



We start with the first term in the definition (|57|) of Dji . Under AQ] there exist 77 > and a 
random variable iV, almost surely finite, such that on the set T, 



inf mf{9 n (i) A OJi)} > rj a.s. 

n>N i 



(60) 



Since sup 0ee :ceX 
x £ X, 8,9' £ e, 



Hq(x) < 00 (see Lemma l5.9p . there exists a constant C such that for any 

(61) 

By (i60l) , Lemmas 15.61 and 15.71 there exists a constant C such that 



H(x,0')| <Csup H d {y)-H e ,(y) 



sup |S(x, fc ) - H(x, 0*)| l r < C \9 k - 0*|l r , 

and the right-hand side converges to zero almost surely. For the second term in the defini- 
tion (|57p of Dip, we use Lemma |5.6[ (|60p and the bound sup eg Q xe x $)l < 00 to obtain 
the existence of a constant C such that, for any k > N, 



E(x, 9*) {ir ek (x) - TT di (x)} X(dx) 



lr < sup \E(x, 9) 
< C\8 k -8*\1 T . 



\ir$ k d\ - 7r^dA||Tvlr 



The right-hand side converges to zero almost surely. We turn finally to the third term in the 
definition (I57p of - Following the same lines as in the proof of Lemma [5. 91 it can be proved 
with the help of (|60p -([6Tj) and Lemmas 15.61 and 15.71 that there exists a constant C such that 
for any k > N, 



sup 



lr<C|0 fc -0 fc _i|l r 



and the right-hand side converges to zero almost surely. This concludes the proof of (|59j) and 
the proof of the condition C2c of |10| Theorem 1.1]. 
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Condition C3. We write r n+ i = r^L + r^L with 

r„+i d = Pe n+1 H en+1 (X n+1 ) - P 6n He n (X n+1 ) , = f P 6n He n {X n ) - Pe n+1 He n+1 (X n+ i) . 

By Lemma 15.91 and (l60l) . there exists a random variable X almost-surely finite such that 
frnlil < X \9 n+ x -0 n \< Xj n+1 . Moreover, 

H e (x)\ , 

where the supremum in the right-hand side is finite by Lemma 15.91 This concludes the proof 
of condition C3. 

Condition C4. This condition is precisely assumptions A lSlbl tcl and A|4j 

Limiting variance. In case (i) of assumption A|4j the limiting variance S solves the equation 
£V/i(6g T + V/i(0*)£ = -E7*. Since V/i(0*) = -d _1 Id, it holds S = (d/2)J7*. In case (ii), the 
limiting variance solves the equation E(Id + 2^ i X/h{6±) T ) + (Id + 27*V/i(#*))£ = —2^*11*, so 
that (d - 2^)T, = -^dU*. 

5.7 Proof of the results presented in Section 14.11 

We use the following notation in the proofs presented in this section. For i ^ j £ {1,2,3}, 
the time to go from i to j for the non-adaptive dynamics is denoted 

Tj_>j = min |ra : X n = j starting from Xq = i\. (62) 

A similar definition holds for the time T^j to go from % to j for the adaptive dynamics. 

5.7.1 Proof of Lemma 14.11 

Using the Markov property and decomposing a trajectory from state 1 to state 3 as successive 
attempts from 1 to 2 back to 1, and eventually a successful transition from 1 to 2 up to 3, it 
is easy to check that: 

TV 

T^3 = E(^->2+^\ { i,3}) (63) 

71=1 

where 

N ~ geo , T^ 2 ~ geo (^j , ^{1,3} ~ 

are independent geometric random variables. The random variable iV represents the number of 
excursions in state 2 before state 3 is eventually visited (we call excursion in state 2 a maximal 
sequence of consecutive times when the chain lies in this state). The random variables T 1 ^ 2 
(respectively T 2 _ > .^ 1 3}) are the n-th sojourn time in state 1 (respectively state 2). Notice that 
we have used here the fact that starting from 2, the probability to go to state 1 is equal to 
the probability to go to state 3: this is why iV is with parameter 1/2. 
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E 

k=l 



.(2) 



Pe Hq ( x o ) - p e n He n {X n 




From (|63p . it is easy to check that (|26p holds. Indeed, using the fact that for independent 
geometric random variables A ~ Qeo(a) and ~ Qeo{b) (the random variables being 
i.i.d.), 



^B k ~ geo(ab), 



k=l 

it is easily seen that Ti^ 3 = N\ + N2, where Ni and N 2 are (non-independent) geometric 
random variables: 

Ni ~ £eo (|) , iV 2 ~ ^to Q 
Notice that, in the limit e — > 0, we have the following convergences in law: 

eNl -> ^ 2 -)• 0, 

where <f(l/6) denotes an exponential random variable with parameter 1/6. The result (|27l) is 
then easily obtained. 

5.7.2 Proof of Proposition 14.21 

The heuristic to prove the result is the following. In the limit of small e, to go from 1 to 3, a 
typical path first needs to stay sufficiently long in 1, in order for a transition to 2 to be more 
likely (when 6 n (l) becomes sufficiently large). Then, from 2, the time it takes to go to 3 is 
small compared to the time spent to leave 1 for the first time. The aim of this proof is to 
quantify that by: (i) showing that a transition from 1 to 2 in a well-chosen time is very likely 
and then (ii) showing that once 2 is reached, the time it remains to go to 3 is small compared 
to the first transition time from 1 to 2. The precise result is the following. 

Proposition 5.11. Consider the adaptive dynamics (X n ,9 n ) defined in (I24p . Let us assume 
that a £ [1/2, 1] and that, if a = 1/2, 7* < 1. Then, 



limP(lU 3 £ (o(e),6(e))J = 1, (64) 

with 

• for a £ [1/2, 1), 

b(e) = C b \lne\ 1 ^ 1 - a \ a(e) = [| lne| - /3(e)] V ' ° , 

where C b is any constant such that 

/-, \ l/(l-a) 

and /3(e) is any nonnegative function smaller than | ln(e)| and such that 

/ \ a/(l— a) , 

lim f ine -/3(e)) e"^ (e) = 0; (65) 
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• for a = 1, 

a(e) = e-W^fie), 6(e) = e^ 1 ^ g^s), 
for any two positive functions f and g such that 

lim /(e) = 0, lim g(e) = oo. 

In the case a G [1/2,1), an example of a simple admissible lower bound is a(e) = 
Callnel 1 ^ 1 -") where C a is any constant such that C a < (^—^] ■ This is the lower bound 



stated in Proposition [OJ In this case, one should consider /3(e) = min f 1, (jt^ — C^~ a ) ) | hie| 
which indeed satisfies (1651). 



Before we perform the proof, let us first introduce some notation. A crucial role will be 
played by the time the dynamics needs to reach 2 for the first time: 

Ti_j, 2 = min |n : X n = 2 starting from Xq = l|. (66) 

The probability to go from state 1 to state 2 in exactly n moves is 

P(T 1 °_ 2 = n)=p? 1 ...^ 1 - 2 ^- 1 , (67) 

with 

p£ = 1 - I (eZ m A 1) , p™ = 1 - p™ = 1 ( e ~ m A 1) , 

where 

m 

Zm = l[(l + lk). (68) 
fc=l 

The first n — 1 factors in (|67p correspond to staying in state 1 (with the appropriate update 
of the weights), and the last one corresponds to the transition from state 1 to state 2. An 
important inequality, which will be used below, is p^ < p™ 2 (and thus p^] > pjjj for m < n: 
when the system is stuck in state 1, as time goes, the probability to go to state 2 increases. 
Estimates on the exit time Ti_>3 are based on the following equality: 

N 2 -n 

ri-3=2?_> 2 + T ^2 + N2 (69) 
i=i 

where A^ 2 is the number of times the chain is in 2 before going to 3, N%-+\ is the number of 
jumps from 2 back to 1 before going to 3 and T{_^ 2 is the time it takes to leave 1 at the i-th 
return to the state 1 from 2. Notice that 

N 2 ^i < N 2 . 

To make these quantities more precise, let us introduce the successive first passage times: for 
i € {l,...,jV 2 _n}, 

ToU = inf { n > t\^ 2 ,X n = l}, (70) 
with, by convention Tf_-. 2 = T®^ 2 an d, 



t{_# = inf {n > rj^, X n = 2} . 

L ' 

Let us first state a simple result concerning A^ 2 . 



Note that I?_> 2 = r^ 2 - r\^ x 
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Lemma 5.12. The random variable N2 is geometric with parameter 1/3: for all n > 0, 

¥(N 2 > n) -- 



2 x n 



Proof. This result is based on the fact that before visiting the state 3, n (3) = 1 remains 
unchanged while 9 n (2) > 1. This means that for n < Ti-^, 

where we have used the inequality e < 1. At each time the system is in state 2, it stays in 
state 2 or goes to state 1 at the next time with probability 2/3. This concludes the proof. □ 



Thus, in (|69j) . the last term plays no role in the limit e — > 0. We show below that this 
is also true for the second term: the main role is played by T®^ 2 . This is why we first need 
to precisely estimate the time T®^ 2 . This can be done for any a £ (0, 1] (and not only in 
[1/2, 1]), and without any restriction on 7*. 

Lemma 5.13. Fix a G (0, 1]. Then, 



limPfT^G (o(e),6(e)) ) = 1, (71) 



where 



if a 6 (0, 1) ; 6(e) = C^\ lnej 1 ^ 1 a > where is any constant such that C? > 



\ l/(l-a) 

1 — a \ /\ / 



1 — a 

7* 



l/(l-a) 



and a(e) = I (I In el — /3(e)) I for any non-negative function /3(e) smaller 

I 7* / 
than I ln(e)| and satisfying (f65|) ,- 

• if a = 1, a(e) = /(e)e _1 ^ 1+7 *^, 6(e) = g(e)e~ 1 ^ 1+ " / *\ for any positive functions f and 
q such that lim f (e) = and lim q(e) = 00. 

The proof of Lemma 15.131 can be read in Section 15.7.31 for the case a < 1 and Section 15.7.41 
for the case a = 1. As will become clear in the proofs, an important computation to guess 
the correct scaling is to consider the typical time n(e) for which F(T®^ 2 = n ( £ )) i s °f order 1. 
Let us consider for example n(e) such that ¥(T^ 2 > n ( e )) — 2/3 which writes equivalently 
Ilfc=i — I ( e ^fc A 1)) ^ 2/3. Using an expansion when e goes to 0, assuming that eH n ( e ) 
goes to zero (this will be checked a posteriori), we obtain that n(e) satisfies X^feS "fc — ~- 
We will thus consider as a guess for the right scaling of the time T®^ 2 

n(e) = argmin |^S fc > i| . (72) 

Depending on the values of a and 7* in (125ft . this yields various asymptotic behaviors for n(e). 

Since Ti_j.3 > T^ 2 + 1, the lower bound in Proposition 15.111 immediately follows from 
this lemma. The upper bound requires some more work. We choose 6(e) < 6(e) satisfying the 
assumptions of Lemma [5.131 Let us also introduce a function a(e) satisfying the assumptions 
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of Lemma 15,131 and a positive functions A(e). Additional requirement on the functions b, a 
and A will be made precise below. Then, using (|69|) . 



S (TU 3 > 6(e)) < i (*(e),b(e)) ) + ¥ { N i > A ( e ), 

+ P(r 1 % 2 G (a(e),6(e)) , iV 2 < A(e), IU 3 > 6(e)) 
< H T L2 i (a(e)Mej) ) + W > A(e)) (73) 



+ P ^-2 e (a(e), 6(e)) , iV 2 < A(e), £ T i^2 > 6(e) - A(e) - 6(e) j . 

The first term in the right-hand side goes to zero as e — > by Lemma 15.131 Likewise, as 
soon as A(e) tends to oo when e goes to zero, the second term goes to zero by Lemma 15.121 
Concerning the third term, the idea is the following: we would like to choose a, b and A 
such that, on the event T®_^ 2 ^ ^ a (e),6(e)) and N 2 < A(e), the times T\_^ 2 can ^ e simply 
controlled using the fact that the state 1 has already been visited for a long time (namely 
T®^ 2 > a (e) an d therefore 9 T o ^(1) is large) and the state 2 is not visited many times (this 
corresponds to N 2 < A(e) so that n {2) remains small). The following lemma quantifies the 
latter idea. 

Lemma 5.14. Let us assume that A is a non-negative function such that A(e) = 0(a(e) a ), 
where a is any positive function such that lim a(e) = +oo. Let v 2 (n) denote the number of 

visits of state 2 up to time n included. Then, there exists a constant R > independent of e 
such that, for any e € (0, 1), on the event {T®^ 2 > a(e)}, for any n such that v 2 (n) < A(e), 

n (2) < R. 

Proof. On the event {T®^ 2 > a(e)}, for n such that v 2 {n) < A(e), it holds 

ro( B )+A(e)l+l 



4(2) < (l + | 




k=[a{e)i+l 

ro(e)+A(e)l+l \ r«(e)+A(e)l+l \a{e)+A(e)]+l 

n - e e £ 

fc=|_a(e)J+l / fc=|a(e)J + l k=[a(e)\ + l 

\a(e)+A(e)]+l k f \a(e)+A(e)]+l 

< > — dx = — dx. 

Z ✓ /, . rpa I, , Sl ^.a 

fc=|_a(e)J+l 



fc-1 37 ■'KOJ 



When a = 1, the right-hand side is equal to 7* In f )' which gives the claimed 

result. For a 6 (0, 1), 

-1 

W / ~ 1 - a 1 WJ I V L a ( e )J 



ra(e)+A(e)l+l \ / / . i_ Q \ 

h n We) rif +1 -1 



fc=|a(e)J+l 



A(e) 

<2 7 * 



; a(e) a ' 

for e small enough since A(e)/a(e) = o(l). This concludes the proof. □ 
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We now need to distinguish between the two cases a < 1 and a = 1 to complete the proof 
of Proposition 15.111 This is the content of the next two sections. 



5.7.3 Proofs of the technical results in the case a G (0, 1) 

Proof of Lemma \5.1'A For a G (0, 1), it holds, in the limit n — > oo, 

(n \ n 

n( i+ ^) ~i*Y. k ~ a ~T^ nl ~ a - (74) 
k=l J k=l a 

We therefore formally obtain (in view of (|72p ) that n(e) is of the order of | lnel 1 ^ 1- "), 

which gives an idea of what could be the scaling for the time T®^ 2 - 

/ \l/(l-a) 

Let us first start by the lower bound on T®_^ 2 • Let a be of the form a(e) = I (| In e\ — (3(e) ) 
for any non-negative function /3(e) smaller than | ln(e)| and satisfying (I65p . By (I67p . 

/bWJ \ We)J f \ r La(e)J 

In (P{2^ 2 > o(e)}) = In [] = £ In (l - - (sE k A 1) > £ ( e S fc A 1) 

\ fc=0 J k=0 ^ ' fc=0 

n L«(e)J 

> E s.. 

fc=0 

where we have used that, by concavity of In, ln(l — x) > —Cqx for x G (0, 1/3) with Co = 
-31n(2/3) > 0. Now, for a G (0, 1), 

n n rk 

ln(H n ) < 7 , V A;" < 7 * V / x"° dx = -^n 1 "" (75) 

We deduce that 

n r k+l 



k=0 k=0 v 7 fc=0 k v 7 

/ 7* i_ a \ , . (n + l) Q f n+1 _ a ( 7* 
exp x ) dx < / 7*x exp x ax 



o 



1 - a / 1* Jo \l- a 



1 /„ i 1 \Q ( 7* f_ , ^l-Q 



< — (n + 1)° exp (n + 1 

7* V 1 - a 



Hence, using the inequality (x + y) s < x s + y s for any (x, y) G ffi?_ and 5 G (0, 1), 
In (p{2?_ 2 > a(e)}) > (o(e) + l) a exp + l) 1 " 01 ) 



> -C ie a(e) Q exp ( -^a^) 1 "" 
1 — a 



where Ci is a constant independent of e. Therefore, 

\ a/(l— a) 

— ^(|ln £ |-/3( e ))J exp (|lne|- /3(e)) 

>-C 2 (|ln e |-/3(e)) Q /( 1 - Q )exp(-/3( £ )), 
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where C2 is a constant independent of e. Thus, under the assumption (|65j) . we indeed obtain 
that ]im e _>oP(TP_ +2 < a(e)) = 0. 

We now turn to an estimate of an upper bound for T^ 2 - Let us introduce a function 

l/(l-a) 



6(e) = CV| lnel 1 ^ 1 a ) where is any constant such that CV > ( 



an intermediate time n(e) < 6(e) such that = V^> which equivalently writes 

oNe) Al ) = o- 



We also define 



We choose 



n 



(e) = [CI lnel 1 ^ 1 "^' 



1 — a 



l/(l-a) 



<C<Cr 



In view of (|74p . and since 1 < 3^7- , it holds H n > exp(C a 1 n 1 a ) for n large enough. 
Thus, for e small enough, we obtain 



H fi(£) >exp(C' Q - 1 [Cllnel 1 /^ 



-1 l-Q 



> 



so that = 1/3. An upper bound on T®^ 2 i s then obtained as (notice that for e small 



enough, 



6(e) - 1 > n(e)): 



•CZf-a > 5(e)) = II 



k=0 



Pn< 



In- 



[6( £ )J -1 

n as 

fe=n(e) 



2X [b(s)\ -l-fi(s) 

3. 



(c'-q)|lne| 1 /(i-«) 



The right-hand side goes to zero when e goes to 0, which yields the result for the asymptotic 
upper bound 6(e). This ends the proof of Lemma 15.131 in the case a G (0, 1). 

Proof of Proposition [5.11\ Let us assume a G [1/2,1). We recall that the lower-bound 



follows from Lemma 15.131 and the inequality T] 



l->-3 



l->2 



1. We also recall the key esti- 



mate (|73p to prove the upper-bound : 

P(1U 3 > 6(e)) < P(r 1 % 2 i (a(e), 1 + P(iV 

IV2- 



+ P 1^2 e U(e),6(e) , iV 2 < A(e), £ T^ 2 > 6(e) - A(e) - 6(e) . 



(76) 



i=l 



We choose 6(e) = C b \ Ine^i-"), 6(e) = Cr| In e| V(i-«) with C 6 > Cr > (1=2 



a(e) 



1-a 



(| In e| -/3(e)) 



l/(l-a) 



l/(l-a) 



with /3(e) smaller than | ln(e)| and such that 



lim (| In el - /3(e))" /(1 " a) exp (-/3(e)) = 

£-»0 



and 



(77) 



so that the first term in the right-hand side of (|76p tends to as e — > by Lemma 15.131 By 
Lemma [5.121 the second term tends to as soon as A(e) — > +00. The next lemma, the proof 
of which is postponed, is devoted to the third term. 
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< 



Lemma 5.15. Assume that a G [1/2, 1) and A(e) = 0{a a {e)) as e — > 0. Then, there exists a 
constant C > such that 

P ^Z?_> 2 G (a(e),6(e)) , iV 2 < A(e), E T ^2 > - A(e) - 6(e) j 

A(e)exp ^-C^y|lne| 1/(1 - a) exp(-/3(e))^ if a € (1/2,1) 

A(e)exp ^-C^yllnel 2 " 7 * exp(-/3(e))^ if a = 1/2. 

To complete the proof of Proposition 15.11] we now explain how to choose /3 and A. The 
function A satisfies A(e) = 0(a a (e)) and goes to infinity as slowly as needed. Then, we need 
to find /3 > such that /3(e) < | lne|, (J77J) holds and, using the fact that A can be chosen 
going to infinity as slowly as needed, 

if a G (1/2, 1), lim I lne| 1/(1 ~ a) exp (-/3(e)) = +oo 
if a = 1/2, lim | lne| 2 ~ 7 * exp (-/3(e)) = +oo. 

This is indeed possible without restriction when a G (0, 1/2) and if and only if 2 — 7^ > 1 
when a = 1/2. 

In order to prove Lemma l5.15[ we need some lower bounds on 3 n . 

Lemma 5.16. There exists a constant C > independent of n such that the following esti- 
mates hold. For a G (1/2, 1), 



while for a = 1/2, 



H n >Cexp( -^n 1 -* 
1 — a 



l n >C exp ( 27* y/n - y In n 



Proof. We start from the lower bound 

n j n 

ln(H n )>^ 7fc --^ 7 2. 

k=l k=l 

For a G (0,1), 

n n i-k+l 

E^^*£ / c?x = T~~~ ((* + I) 1 "" " 1) > T^- [n l ~ a ~ l) 

7 1 7, i J k 



k=l k=l 

so that 



1 n 

fc=i 
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We now distinguish between two cases. For a £ (1/2, 1), 

n fk 



Th Tl It, n f£ •! 

il = il E k ~ 2a * < + ^ E / x ~ 2a dx = ^ + Y^T t 1 - n ^ - 2t * 

fc=l fe=l k=2 Jk ~ 1 a a 



fc=i 

Therefore, for n > 1 



In(S n ) > (n 1 " 
1 — a 

which gives the expected result. For a = 1/2, 



1) 



o 



2a - 1 



so that, for n > 1, 



E^fe < 7* + 7f E / x_1 d2; = ^ + lnn ) ' 

fc=l k=2 k ~ X 



ln(H n ) > 2 7 „ (Vn - l) - ^ (1 + Inn) 



which also gives the claimed result. 



Proof of of Lemma \5.15\ Let c(e) 



□ 



ft(e) ~ Kg) 
A(e) 



1. Using the fact that N 2 ^i < N 2 , 



and denoting by (J r n ) n >o the filtration generated by (X n ) n >o, it holds 



^2- 



Z?-2 e a(e),6(e) , iV 2 < A(e), E *?-2 > He) ~ A(e) - 6(e) 



i=l 



< P ^2 e Ke) , ^2 < A(e), 3i G {1, . . . , iV^}, 2?_> 2 > 



A(e) 



&(g) ~ Kg) 

A(e) 



< E I? (?i->2 e (a(e),6(e)) , iV 2 < A(e), iV 2 ^i = I, 3i € {1, ...,/}, 

Z=l 

A(e) I 

< E E P ( T °^ G (a(e) s S(e)) , ^2 < A(e), iV 2 ^i = /, 2?_ 2 > c(e)) 



i=l i=l 

E P ( T i^2 G ( a(e), 6(e) ) > ^2 < A(e), iV 2 ^i > i, I?_> 2 > c(e) 

i=l 
Me) 



< E E ( 1 {^ a >«W.^i>i} P r 2 - A ( £ )'^2 > 



(78) 



i=l 



where is defined by (I70p . We recall that z/ 2 (n) denotes the number of visits of state 2 up 
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to time n included. On N 2 ^\ > i, N 2 > v 2 ( T 2^i + ^l-^) anc ^ therefore 



N 2 < A(e),T 1 % 2 > c(e) 
< E I 1 



E 1 



^(r^ 1 + C ( £ )-2)<A( E ),X T ^ i =l,...,X T ^ i+c(e) _ 2 = 



^(r^ 1+ c(,)-2)<A( £ ),X TUi =l,...,X rUi+c{e) _ 2 : 



A' 



1 - (1,2) 



J 7 . 



^2^i+ c (£)- 2 
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'2^1 



(79) 



We recall that 



P, (1 ,2) = i( £ W A1 V 

" 3 V M2) y 

On the event T®_^ 2 > a ( e )> we have, for n > a(e), 8 n (l) > H a ( E ), so that 

P,(l,2) = if £ !^Al')>5^Ai. 
" 3\ e n (2) J - 39 n (2) 3 

Notice that 6 n {2) > 1 and, from ()75|) . eH a ( £ ) < exp(— /3(e)) which goes to zero as e goes to 
infinity. Therefore, in this limit, 

P e (12) > fr^fcL 

" l ^ " 30 n (2) 

By Lemma I57T31 on the event {Tf^ 2 > a(e)} fl {^2(^2^1 + c ( e ) ~ 2 ) < A (e)}, Jt holds 
^jy 2 ( T » + c ( e )_2)(2) < -R with R independent of e. With (|79p . we deduce that on {T®^ 2 > 
a(e)J n {iV 2 _>i > i}, 

'N 2 < A(e), T^ 2 > c(e) ( 
<E|l f ! fl-ff^.Vl-P 9 . (1,2) 

V {^(r^ 1 +c( £ )-3)<A( £ ),X T ^ i =l,...,X T ^ i+c{£) _ 3 = l| V 3R J \ 

Iterating the reasoning, we obtain, on {T^L^ > ^ {-^2->i > i}, 
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T 2->1 



N 2 < A(e), TU 2 > c(e) 
With ([75]). we deduce that 



3i? 



c( £ )-l 



/ N 2 -,i \ 

( I?_ a e (o(e),6(e)) , iV 2 < A(e), ^ I?_> 2 > 6(e) - A(e) - 6(e)) 
< A(e)exp ((c(e) - 1) In (l - f^> 



< 



V 3i? 
A(e)exp (-K^llne^-^sE^ 
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for some positive constant K > 0. We conclude by Lemma 15.161 which ensures 



for a G (1/2, 1) and 



C A^ |lne|1/lMeXpH(e)) - 



^--llnelVd-^eSa^ > lne| 2 eexp y 7 ^) - y ln(a(e))) 

= C^y|lne| 2 exp(-/3(e)) (| lne| - /3(e)\)~ J * 
>C^|lne| 2 -^exp(-/3(e)). 

for a = 1/2. 

5.7.4 Proofs of the technical results for a = 1 

Proof of Lemma 15.131 In the case a = 1, the asymptotic behavior of H n is explicitly 
known. Indeed, using the Stirling formula, we have (in the limit n large), 



k=l k=l 



Thus, in this case, we obtain with (J72J) that n(e) ~ e^+~t*T(2 + 7*) 1/(1 +T*) which 
motivates the scaling for the lower and upper bounds of T®_^ 2 - 

For the lower bound, we choose a(e) = /(e) e _1 /( 1 +7*) f or an y function / such that 
lim /(e) = 0. We have: 



£->0 



L«(*)J-i 

I?_ 2 <a( e ))=l-P(T 1 _ 2 >La(e)j)=l- [] Pii 



fc=0 

< 1 - [ P [f )} ) K£)J = 1 - exp ( la(e) j In (pLf )J ) ) < _ La(e ) j l n (#)J 

= -La(e)J ln f 1 - ^ ( eH L^)J A !)) ^ "L a ( e )J ln ( x " ^ eH We)j) 

< -La(e)J ln(l - Cea(e) 7 *) < -|a(e)J In (l - Ce 1/(1 +>)/( £ )>) 

< CLa(e)Je 1 /( 1 +7*) / ( e )7* < cf(e) 1+ r, 

which converges to as e goes to 0. 

We now consider the upper bound. We set 6(e) = <jr(e)e — 1 '( 1 +7*) with \\m £ ^ > og{e) = oo. 
In the following, we assume that g grows sufficiently slowly so that lim e ^o e(6(e)) 7 * = 0. This 
is not a restrictive assumption since the probability P(lf_ >2 ^ h ( e )) 

is even lower when the 
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functions g go faster to infinity. Moreover, upon replacing g(e) by e 1 ' ( 1+7 *) |_£ 1 ^ 1+7 *-*<7(e)J , 
we may assume that 6 : (0, 1) — > N. Using (f80|) . it holds 

b(e)-l 6(e)-l 

P(T^ 2 > 6(e)) = 11^1= II (l-gCeSfcAl)). 

fc=0 fc=o ^ ' 

For /c < 6(e), it holds eH& < eS 1 6(e)J — Ce6(e) 7 *, which goes to zero as e goes to zero by 
assumption. Thus, for e sufficiently small, 

6(e)-l 6(e)-l 

2?_2>6(e))= n (!-^)< n (i-^ 7 *)> (si) 

fc=0 fc=0 

where C is a constant independent of e. Then, using the fact that e6(e) 7 * is smaller than 1/C 
for e sufficiently small, we have in this limit (in this series of equations, the positive constant 
C may change from line to line) 

(b(e)-l \ 6(e)-1 

Yl (1-Cefc 7 *) = ln(l-Cefe 7 *) < -Ce A; 7 * 
fc=0 J k=0 k=l 

< —Ce S~] / x 7 * dx = — Ce / x 7 * da; 
fc =i -^-i ^ 



-Ce(6(e) - 1) 7 * +1 < -C<?(e 



l7*+l 



Using this estimate in (|8ip leads to P(T 1 ) _ !> 2 — — ex P( — C<?( e ) 7 * +1 )) the right-hand side 

going to as e — > 0. This therefore concludes the proof of Lemma 15.131 in the case a = 1. 



Proof of Proposition 15.111 Using again the fact that Ti_>3 > ?i_^ 2 + 1, we already have 
an asymptotic lower bound on Ti-^, the same as for T^ 2 - Namely, for a(e) = /(e)e _1 /( 1+7 *), 
where / is any positive function such that lim^o /(e) = 0, 

p(Ti^ 3 < o(e)) < P(lf_,2 < a(e)), 

and the right-hand side goes to zero when e goes to zero by Lemma [5.131 
For the upper bound, we set 

b(£) = g{£)£~ l /^\ 6(e) = 5(e)e- 1/(1+7 * } 

for any positive functions g and g such that lim g(e) = lim g(e) = oo and g < g. As above 
(see (|73p ). we write 

p(t^ 3 > 6(e)) < p(t°+ 2 £ (a(e),6(e)) ) +p(iV 2 > A(e)) 

+ F 2?_> 2 e (a(e), 6(e)) , iV 2 < A(e), £ 7^ > 6(e) - A(e) - 6(e) 
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For any function A which goes to infinity when e goes to zero, and given our choices for a and 
6, the first two terms in the right-hand side go to zero. For the last term, assuming from now 
on that / is such that lim e ^o a(e) = +00 and that A(e) = 0(a(e)), we argue like in the proof 
of Lemma [5. 151 to obtain 



A(e) I V 3i? 

£A(E)exp ((M£y<£) £ - 1 /a + .,_ 1 ) lll ( i :f| f . 

Using the fact that, by (|80|) . there exists a constant C independent of e such that 

so that the left-hand side goes to zero when e goes to zero, we obtain (the constants C, C are 
independent from e, and their values may change from one occurrence to another) 

P ( ^2 e (o(e),Ke)) . ^2 < A(e), ^ ^ 2 > 6(e) - A(e) - 6(e) j 
<C'A (e) exp(-C^ £ -V(H,) £5a(£) 



< C'exp (-C g( ^J (£) /( £ p+ln(A( g ))) . 



For a given (7, it is always possible to choose / and g such that (g(e) — g(e))f(e)' y * goes to 
infinity as e goes to zero. Then, one can choose A which grows sufficiently slowly at infinity 
such that the right-hand side goes to zero. This ends the proof of Proposition 15 .111 in the case 
a = 1. I 
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